Scrapy1.6 crawler frame handling tab 3

Today we designed for beginners to practice crawling reptiles site http://books.toscrape.com/
This is a book site, the default is 50 pages, each page will show 20 books, we have to put all one-time book all title and price crawl down.

71414-a9e30e5c213f396a.png
image.png

Process is actually very simple

  1. New Project scrapy startproject book
  2. cd book; tree # 查看下项目结构
  3. New files spiders directory book_spider.py
  4. Html structure analysis, first through the Developer Tools feature chrome elements of the review of
    the command-line combinationscrapy shell "http://books.toscrape.com/"

Update book_spider.py as follows, the content is very simple

import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = [
        'http://books.toscrape.com/',
    ]

    def parse(self, response):
        for book in response.css('article.product_pod'):
            # 选择器可以通过命令行工具就行调试
            yield {
                # xpath 语法 @ATTR 为选中为名ATTR的属性节点
                'name': book.xpath('h3/a/@title').get(),
                'price': book.css('p.price_color::text').get(),
            }
  1. Test output scrapy crawl books -o book.jl

jl format is json line

  1. In order to complete capture, to handle pagination
class BooksSpider(scrapy.Spider):
    # 爬取命令 scrapy crawl books
    name = "books"

    start_urls = [
        'http://books.toscrape.com/',
    ]

    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'name': book.xpath('h3/a/@title').get(),
                'price': book.css('p.price_color::text').get(),
            }

        # 检查分页
        # 提取下一页的链接
        next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
        if next_url:
            next_url = response.urljoin(next_url)
            # 构造新的 Request 对象
            yield scrapy.Request(next_url, callback=self.parse)

Interpretation
urljoin response object provides a method, passing in a relative address absolute address generation, and then generate a new Request object
Scrapy itself is not difficult, the focus is the foundation of Python

Reproduced in: https: //www.jianshu.com/p/a8a0a4bb7811

Guess you like

Origin blog.csdn.net/weixin_34290000/article/details/91072367