Python crawler basics (5): using scrapy framework

Series article index

Python crawler basics (1): Detailed explanation of the use of urllib library
Python crawler basics (2): Using xpath and jsonpath to parse crawled data
Python crawler basics (3): Using Selenium to dynamically load web pages
Python crawler basics (4): More convenient to use Python crawler basics of requests library
(5): Using scrapy framework

1. Introduction to scrapy

1. What is scrapy?

Scrapy is an application framework written to crawl web page data and extract structured data. The framework is encapsulated and includes request (asynchronous scheduling and processing), downloader (multi-threaded Downloader), parser (selector) and twisted (Asynchronous processing) etc. For website content crawling, its speed is very fast.

2. scrapy installation

# 进入到python安装目录的Scripts目录
d:
cd D:\python\Scripts
# 安装 可以使用国内源
pip install scrapy

3. Scrapy architecture composition

(1) Engine: Runs automatically, no need to pay attention, it will automatically organize all request objects and distribute them to downloaders.
(2) Downloader: After obtaining the request object from the engine, request the data.
(3) Spiders: The spider class defines how to crawl a certain (or certain) website. It includes crawling actions (for example: whether to follow links) and how to extract structured data from the content of web pages (crawling items). In other words, Spider is the place where crawling actions are defined and a web page is analyzed.
(4) Scheduler: It has its own scheduling rules, no need to pay attention to it.
(5) Pipeline (Item pipeline): The pipeline that ultimately processes data will reserve interfaces for us to process data. After the Item is collected in the Spider, it will be passed to the Item pipeline, and some components will perform processing of the Item in a certain order. Each item pipeline component is a Python class that implements simple methods. They receive the item and perform some actions through it. They also decide whether the item continues to pass through the pipeline, or is discarded and no longer processed.

The following are some typical applications of the item pipeline:
(1) Clean HTML data, (2) Verify the crawled data (check that the item contains certain fields), (3) Check for duplication (and discard), (4) Convert the crawled results Save to database.

4. How scrapy works

spiders->scheduler (scheduler)->scrapy engine (engine)->downloader (downloader)->Download from the Internet->Downloaded
data from the engine to spiders->Data analysis through engine xpath->Use pipeline to data storage

Insert image description here
Insert image description here

2. Basic use of scrapy

1. Create a project

Enter the project directory and open cmd:

# 创建scrapy_test_001项目,项目名不能以数字、汉字开头
scrapy startproject scrapy_test

2. Create a crawler file

To create a crawler file in the spiders folder

# cd 项目的名字\项目的名字\spiders
cd scrapy_test\scrapy_test\spiders

Create a crawler file. Note that there is no need to add the http protocol:

# scrapy genspider 爬虫文件的名字  要爬取网页
scrapy genspider baidu  www.baidu.com

At this time, in the spiders directory, a baidu.py will be generated:
Insert image description here
Let's take a look at the contents of baidu.py:

import scrapy

class BaiduSpider(scrapy.Spider):
    # 爬虫的名字  用于运行爬虫的时候 使用的值
    name = "baidu"
    # 允许访问的域名
    allowed_domains = ["www.baidu.com"]
    # 起始的url地址  指的是第一次要访问的域名
    start_urls = ["https://www.baidu.com"]

    # 是执行了start_urls之后 执行的方法   方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen()
    #       response  = requests.get()
    def parse(self, response):
        pass

Later, we can process the response in the parse method, which is the final crawling result.

3. (Attachment) Project composition

Insert image description here

4. Run the crawler code

(1) Modify baidu.py

In the parse method, customize the output:

def parse(self,response):
    print('输出正确!')

In the spiders directory, execute the following command to run the crawler code:

# scrapy crawl 爬虫的名字
scrapy crawl baidu

A lot of content will be output, but there is nothing we printed.

(2) robots file

In the console, Baidu's robots protocol is printed:
Insert image description here

Every website will have a robots gentlemen's agreement, which defines what is not allowed to be crawled. Let's look at Baidu's robots:
https://www.baidu.com/robots.txt

In the project's settings.py file, the default is ROBOTSTXT_OBEY=True, which means following this gentleman's agreement.

We just need to comment out this line:
Insert image description here
at this point we will execute the crawler code:

# scrapy crawl 爬虫的名字
scrapy crawl baidu

At this time, in the command line, our customized sentence will be printed.

5. Response attributes and methods

response.xpath(xpath_expression): Select and extract data based on XPath expression.
response.css(css_expression): Select and extract data based on CSS selectors.
response.follow(url): Create a new request based on the given URL and continue processing through the callback method.
response.url: Returns the URL of the current response.
response.status: Returns the status code of the current response.
response.headers: Returns the header information of the current response.
response.body: Returns the raw binary content of the current response.
response.text: Returns the text content of the current response.
response.css('a::attr(href)').getall(): Use CSS selector to extract all matching element attribute values.
response.xpath('//a/@href').extract(): Use XPath expressions to extract all matching element attribute values

6. Practical combat: Obtain the content of Baidu’s [Baidu Click] button

Insert image description here

    def parse(self, response):
        print('=====================')
        input = response.xpath('//input[@id="su"]/@value')[0]
        print(input.extract()) # 百度一下
        print('=====================')

7. Practical combat: Obtain Autohome car price list

Insert image description here

import scrapy

class CarSpider(scrapy.Spider):
    name = 'car'
    allowed_domains = ['https://car.autohome.com.cn/price/brand-15.html']
    start_urls = ['https://car.autohome.com.cn/price/brand-15.html']

    def parse(self, response):
        print('=======================')
        name_list = response.xpath('//div[@class="main-title"]/a/text()')
        price_list = response.xpath('//div[@class="main-lever"]//span/span/text()')

        for i in range(len(name_list)):
            name = name_list[i].extract()
            price = price_list[i].extract()
            print(name,price)
        print('=======================')

3. Use scrapy shell

1. What is scrapy shell?

The scrapy terminal is an interactive terminal for you to try and debug your crawling code without starting the spider. It is intended to be used to test code that extracts data, but you can use it as a normal python terminal and test any python code on it.

This terminal is used to test xPath or CSS expressions to see how they work and extract data from crawled web pages. When writing your spider, the terminal provides the ability to interactively test your expression code, eliminating the trouble of running the spider after each modification. Once you become familiar with the scrapy terminal, you will find it plays a huge role in developing and debugging spiders.

2. Install ipython (optional)

# 进入到python安装目录的Scripts目录
d:
cd D:\python\Scripts
# 安装 可以使用国内源
pip install ipython

If IPython is installed, the scrapy terminal will use IPython (instead of the standard Python terminal). The IPython terminal is more powerful than others, providing intelligent auto-completion, highlighted output, and other features.

Usage: Enter ipython directly on the command line, and the new command window will have its own highlighting and auto-completion:
Insert image description here

3. Use scrapy shell

# 直接在window的终端中输入scrapy shell 域名
# 直接在命令终端(不需要进入python或者ipython终端),执行完毕之后,自动进入ipython终端
scrapy shell www.baidu.com

The response object can be obtained directly and debugged directly here.
Insert image description here

4. Practical combat: Obtaining product data from Dangdang.com

1. Initialize project

# 在自定义目录中创建项目
scrapy startproject scrapy_dangdang

# 创建爬虫文件
# cd 项目的名字\项目的名字\spiders
cd scrapy_dangdang\scrapy_dangdang\spiders

# scrapy genspider 爬虫文件的名字  要爬取网页
# 中国古典小说网址:http://category.dangdang.com/cp01.03.32.00.00.00.html
scrapy genspider dang http://category.dangdang.com/cp01.03.32.00.00.00.html

# 运行
scrapy crawl dang

2. Define item file

In item.py automatically generated by the project, define the data format to be crawled:

import scrapy

class ScrapyDangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 通俗的说就是你要下载的数据都有什么,固定写法:scrapy.Field()

    # 图片
    src = scrapy.Field()
    # 名字
    name = scrapy.Field()
    # 价格
    price = scrapy.Field()

3. Crawl pictures, names, prices

Insert image description here
Let’s first analyze the xpath of these three:
Get the xpath of the image: //ul[@id="component_59"]/li/a/img/@srcBecause the image is lazy loaded, the attributes of the image should be obtained data-original.
Insert image description here
Get the xpath of the title: //ul[@id="component_59"]/li/p[@class="name"]/a/@title
Insert image description here
Get the xpath of the price://ul[@id="component_59"]/li/p[@class="price"]/span[1]/text()
Insert image description here

import scrapy

class DangSpider(scrapy.Spider):
    name = "dang"
    allowed_domains = ["category.dangdang.com"]
    start_urls = ["http://category.dangdang.com/cp01.03.32.00.00.00.html"]

    def parse(self, response):
        #         src = //ul[@id="component_59"]/li/a/img/@src
        #         name = //ul[@id="component_59"]/li/p[@class="name"]/a/@title
        #         price = //ul[@id="component_59"]/li/p[@class="price"]/span[1]/text()
        #         所有的seletor的对象 都可以再次调用xpath方法
        li_list = response.xpath('//ul[@id="component_59"]/li')

        for li in li_list:
            src = li.xpath('./a/img/@data-original').extract_first()
            # 第一张图片和其他的图片的标签的属性是不一样的
            # 第一张图片的src是可以使用的  其他的图片的地址是data-original
            if src:
                src = src
            else:
                src = li.xpath('./a/img/@src').extract_first()

            name = li.xpath('./p[@class="name"]/a/@title').extract_first()
            price = li.xpath('./p[@class="price"]/span[1]/text()').extract_first()
            # 拿到所有信息
            print(src + name + price)

4. Pipeline packaging

(1) Familiar with yield

A function with tied is no longer an ordinary function, but a generator that can be used for iteration.

Yield is a keyword similar to return. When iteration encounters yield, it returns the value after yield (right side). The point is: on the next iteration, execution starts from the code (next line) after the yield encountered by the previous iterator.

Simple understanding: yield means return returns a value, and remembers the returned position, and the next iteration will start from this position (next line).

(2) Construct the item object and hand it to the pipeline

Above we have obtained the picture, name and price, continue to construct the item object in the parse method and hand it to the pipeline:

from scrapy_dangdang.items import ScrapyDangdangItem
            # 构造item对象
            book = ScrapyDangdangItem(src=src,name=name,price=price)

            # 获取一个book就将book交给pipelines
            yield book

(3) Enable pipeline in settings.py

There can be many pipelines, so the pipelines have priorities. The priority range is from 1 to 1000. The smaller the value, the higher the priority.

ITEM_PIPELINES = {
    
    
    #  管道可以有很多个  那么管道是有优先级的  优先级的范围是1到1000   值越小优先级越高
   "scrapy_dangdang.pipelines.ScrapyDangdangPipeline": 300,
}

(4) Edit pipeline.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

# 如果想使用管道的话 那么就必须在settings中开启管道
class ScrapyDangdangPipeline:
    # 在爬虫文件开始的之前就执行的一个方法 :open_spider
    def open_spider(self,spider):
        self.fp = open('book.json','w',encoding='utf-8')

    # 执行过程:process_item
    # item就是yield后面的book对象
    def process_item(self, item, spider):
        # 以下这种模式不推荐  因为每传递过来一个对象 那么就打开一次文件  对文件的操作过于频繁

        # # (1) write方法必须要写一个字符串 而不能是其他的对象
        # # (2) w模式 会每一个对象都打开一次文件 覆盖之前的内容
        # with open('book.json','a',encoding='utf-8')as fp:
        #     fp.write(str(item))

        self.fp.write(str(item))
        self.fp.write('\n')

        return item

    # 在爬虫文件执行完之后  执行的方法 : close_spider
    def close_spider(self,spider):
        self.fp.close()

(5) Execute and view the written json file

scrapy crawl dang

5. Use of multiple pipelines

(1)pipelines.py defines the pipeline class

Multiple classes can be defined in pipelines.py, just write them directly.
There are three default methods in the class that can be used directly.

import urllib.request
class ScrapyDangdangDownloadPipeline:
    def process_item(self, item, spider):
        url = 'http:' + item.get('src')
        filename = './books/' + item.get('name') + '.jpg'
        urllib.request.urlretrieve(url = url, filename= filename)
        # 有返回值
        return item

(2) settings.py opens the pipeline

ITEM_PIPELINES = {
    
    
    #  管道可以有很多个  那么管道是有优先级的  优先级的范围是1到1000   值越小优先级越高
   "scrapy_dangdang.pipelines.ScrapyDangdangPipeline": 300,

   'scrapy_dangdang.pipelines.ScrapyDangdangDownloadPipeline':301
}

(5) Execute and view the written json files and images

First create the books directory under spiders.

# 执行
scrapy crawl dang

6. Get multiple pages of data

import scrapy
from scrapy_dangdang.items import ScrapyDangdangItem


class DangSpider(scrapy.Spider):
    name = "dang"
    allowed_domains = ["category.dangdang.com"]
    start_urls = ["http://category.dangdang.com/cp01.03.32.00.00.00.html"]

    base_url = 'http://category.dangdang.com/cp'
    page = 1
    def parse(self, response):
		# 。。。省略


#       每一页的爬取的业务逻辑全都是一样的,所以我们只需要将执行的那个页的请求再次调用parse方法
# 就可以了
#         http://category.dangdang.com/pg2-cp01.03.32.00.00.00.html
#         http://category.dangdang.com/pg3-cp01.03.32.00.00.00.html
#         http://category.dangdang.com/pg4-cp01.03.32.00.00.00.html

        if self.page < 100:
            self.page = self.page + 1

            url = self.base_url + str(self.page) + '-cp01.03.32.00.00.00.html'

#             怎么去调用parse方法
#             scrapy.Request就是scrpay的get请求
#             url就是请求地址
#             callback是你要执行的那个函数  注意不需要加()
            yield scrapy.Request(url=url,callback=self.parse)


5. Practical combat: Obtaining data from different pages of Movie Paradise

1. Effect

Get Movie Paradise, the movie name in the list on the first page:
Insert image description here
Then click on the movie details, and then get the picture in the details from the second page:
Insert image description here

2. Core code

mv.py core code

import scrapy

from scrapy_movie_099.items import ScrapyMovie099Item

class MvSpider(scrapy.Spider):
    name = 'mv'
    allowed_domains = ['www.dygod.net']
    start_urls = ['https://www.dygod.net/html/gndy/china/index.html']

    def parse(self, response):
#         要第一个的名字 和 第二页的图片
        a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

        for a in a_list:
            # 获取第一页的name 和 要点击的链接
            name = a.xpath('./text()').extract_first()
            href = a.xpath('./@href').extract_first()

            # 第二页的地址是
            url = 'https://www.dygod.net' + href

            # 对第二页的链接发起访问 并将name参数传入
            yield  scrapy.Request(url=url,callback=self.parse_second,meta={
    
    'name':name})

    def parse_second(self,response):
        # 注意 如果拿不到数据的情况下  一定检查你的xpath语法是否正确
        src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()
        # 接受到请求的那个meta参数的值
        name = response.meta['name']

        movie = ScrapyMovie099Item(src=src,name=name)

        yield movie

pipelines.py:

from itemadapter import ItemAdapter

class ScrapyMovie099Pipeline:

    def open_spider(self,spider):
        self.fp = open('movie.json','w',encoding='utf-8')

    def process_item(self, item, spider):

        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

Open pipeline in settings.py:

ITEM_PIPELINES = {
    
    
   'scrapy_movie_099.pipelines.ScrapyMovie099Pipeline': 300,
}

6. Practical combat: Use CrawlSpider to obtain data from Dudu.com

1. Introduction to CrawlSpider

CrawlSpider inherits from scrapy.Spider and can define rules. When parsing the HTML content, it can extract specified links according to the link rules and then send requests to these links.

Therefore, if there is a need to follow up on links, that is, after crawling the web page, you need to extract the link and crawl it again, using CrawlSpider is very suitable.

Common syntax for extracting links :
link extractor, where you can write rules to extract specified links:

scrapy.linkextractors.LinkExtractor(
	allow = (), # 正则表达式,提取符合正则的链接
	deny = (), # (不用)正则表达式 不提取符合正则的链接
	allow_domains = (), # (不用)允许的域名
	deny_domains = (), # (不用)不允许的域名
	restrict_xpaths = (), # xpath,提取符合xpath规则的链接
	restrict_css = () # 提取符合选择器规则的链接
)

# 使用实例
# 正则用法:
links = LinkExtractor(allow = r'list_23_\d+\.html')
# xpath:
links = LinkExtractor(restrict_xpaths = r'//div[@class="x"]')
# css用法:
links = LinkExtractor(restrict_css='.x')

# 提取链接
links.extract_links(response)

2. Create a project

# 创建项目:scrapy startproject 项目的名字
scrapy startproject readbook

# 创建爬虫文件
# cd 项目名字\项目名字\spiders
cd readbook/readbook/Spiders
# scrapy genspider -t crawl 爬虫文件的名字  爬取的域名
scrapy genspider -t crawl read www.dushu.com/book/1188_1.html

We found that the content of read.py was different from what we had before:
Insert image description here

3. Define item

class ReadbookItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    src = scrapy.Field()

4. Extract data

Insert image description here

read.py:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from readbook.items import ReadbookItem

class ReadSpider(CrawlSpider):
    name = "read"
    allowed_domains = ["www.dushu.com"]
    # 注意,第一个url也要匹配规则!不然会跳过第一页
    start_urls = ["https://www.dushu.com/book/1188_1.html"]

    # 规则
    rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+.html"), callback="parse_item", follow=True),)

    # 解析
    def parse_item(self, response):
        img_list = response.xpath('//div[@class="bookslist"]//img')

        for img in img_list:
            name = img.xpath('./@data-original').extract_first()
            src = img.xpath('./@alt').extract_first()

            book = ReadbookItem(name=name,src=src)
            yield book

5. Define pipeline

# settings.py
ITEM_PIPELINES = {
    
    
   "readbook.pipelines.ReadbookPipeline": 300,
}
# pipelines.py
from itemadapter import ItemAdapter

class ReadbookPipeline:
    def open_spider(self,spider):
        self.fp = open('book.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self,spider):
        self.fp.close()

6. Start

 scrapy crawl read

After execution, view the book.json file.

7. Save to mysql

Install pymysql:

# 进入到python安装目录的Scripts目录
d:
cd D:\python\Scripts
# 安装 可以使用国内源
pip install pymysql

Add a pipeline:

# settings.py
ITEM_PIPELINES = {
    
    
   "readbook.pipelines.ReadbookPipeline": 300,
   # MysqlPipeline
   'readbook.pipelines.MysqlPipeline':301
}
# 参数中一个端口号 一个是字符集 都要注意
DB_HOST = '192.168.1.1'
# 端口号是一个整数
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWROD = '123'
DB_NAME = 'spider01'
# utf-8的杠不允许写
DB_CHARSET = 'utf8'

pipelines.py:

# 加载settings文件
from scrapy.utils.project import get_project_settings
import pymysql


class MysqlPipeline:

    def open_spider(self,spider):
        settings = get_project_settings()
        self.host = settings['DB_HOST']
        self.port =settings['DB_PORT']
        self.user =settings['DB_USER']
        self.password =settings['DB_PASSWROD']
        self.name =settings['DB_NAME']
        self.charset =settings['DB_CHARSET']

        self.connect()

    def connect(self):
        self.conn = pymysql.connect(
                            host=self.host,
                            port=self.port,
                            user=self.user,
                            password=self.password,
                            db=self.name,
                            charset=self.charset
        )

        self.cursor = self.conn.cursor()


    def process_item(self, item, spider):

        sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'],item['src'])
        # 执行sql语句
        self.cursor.execute(sql)
        # 提交
        self.conn.commit()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

7. Practical combat: sending post request

Key code:

import scrapy

import json

class TestpostSpider(scrapy.Spider):
    name = 'testpost'
    allowed_domains = ['https://fanyi.baidu.com/sug']
    # post请求 如果没有参数 那么这个请求将没有任何意义
    # 所以start_urls 也没有用了
    # parse方法也没有用了
    # start_urls = ['https://fanyi.baidu.com/sug/']
    #
    # def parse(self, response):
    #     pass



	# start_requests是一个固定方法
    def start_requests(self):
        url = 'https://fanyi.baidu.com/sug'

        data = {
    
    
            'kw': 'final'
        }

        yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse_second)

    def parse_second(self,response):

        content = response.text
        obj = json.loads(content,encoding='utf-8')

        print(obj)

Guess you like

Origin blog.csdn.net/A_art_xiang/article/details/132807935