Scrapy 的学习笔记(一)

使用pip 按装Scrapy

命令: pip install Scrapy

创建一个Scrapy 工程

命令:scrapy startproject tutorial (其中tutorial 是工程名字)

Scrapy 的工程目录结构

tutorial/

scrapy.cfg            # deploy configuration file
						工程配置文件
tutorial/             # project's Python module, you'll import your code from here
						工程的python 模块    可以在这里导入自己的代码
    __init__.py
    items.py          # project items definition file
    						工程项目定义的文件
    middlewares.py    # project middlewares file
    						工程中间件文件
    pipelines.py      # project pipelines file
							项目管限文件
    settings.py       # project settings file
							项目设置文件
    spiders/          # a directory where you'll later put your spiders
    						一个放爬虫的文件夹
        __init__.py

Our first Spider

需要将我们的第一个 Spider 放在 工程目录下的spiders文件夹里面

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
def start_requests(self):
urls = [
http://quotes.toscrape.com/page/1/’,
http://quotes.toscrape.com/page/2/’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
你会明白我们的爬虫子类定义了一些属性和方法


name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
name是区分爬虫的属性。在一个project里面他必须是唯一的,你不会看见相同的name属性在不同的爬虫里面


start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
start_request()方法 必须返回一个iterable (迭代器) (你可以返回一个请求列表 或者一个写生成器函数) Spider 从start_request()函数开始爬取,后续请求会从这个初始请求依次生成


parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
parse() 会被调用去处理请求的响应 响应参数是一个包含页面内容的TextResponse实例,并且有进一步有用的方法去处理它


The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

猜你喜欢

转载自blog.csdn.net/qq_34953652/article/details/82862156