Today we designed for beginners to practice crawling reptiles site http://books.toscrape.com/
This is a book site, the default is 50 pages, each page will show 20 books, we have to put all one-time book all title and price crawl down.
Process is actually very simple
- New Project
scrapy startproject book
cd book; tree # 查看下项目结构
- New files spiders directory
book_spider.py
- Html structure analysis, first through the Developer Tools feature chrome elements of the review of
the command-line combinationscrapy shell "http://books.toscrape.com/"
Update book_spider.py as follows, the content is very simple
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = [
'http://books.toscrape.com/',
]
def parse(self, response):
for book in response.css('article.product_pod'):
# 选择器可以通过命令行工具就行调试
yield {
# xpath 语法 @ATTR 为选中为名ATTR的属性节点
'name': book.xpath('h3/a/@title').get(),
'price': book.css('p.price_color::text').get(),
}
- Test output
scrapy crawl books -o book.jl
jl format is json line
- In order to complete capture, to handle pagination
class BooksSpider(scrapy.Spider):
# 爬取命令 scrapy crawl books
name = "books"
start_urls = [
'http://books.toscrape.com/',
]
def parse(self, response):
for book in response.css('article.product_pod'):
yield {
'name': book.xpath('h3/a/@title').get(),
'price': book.css('p.price_color::text').get(),
}
# 检查分页
# 提取下一页的链接
next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
if next_url:
next_url = response.urljoin(next_url)
# 构造新的 Request 对象
yield scrapy.Request(next_url, callback=self.parse)
Interpretation
urljoin response object provides a method, passing in a relative address absolute address generation, and then generate a new Request object
Scrapy itself is not difficult, the focus is the foundation of Python
Reproduced in: https: //www.jianshu.com/p/a8a0a4bb7811