Scrapy of the Spider

Spider

Spider crawling class defines how a (or some) website. The action included crawling (for example: whether to follow up link) and how to extract structured data from the content of the page (crawling item). In other words, Spider crawling action that you define and analyze where a page (or some web pages).

class scrapy.SpiderIs the most basic kind, all reptiles write must extend this class.

The main use of the function call and order:

  • __init__() : Name initialize reptiles and start_urls list
  • start_requests() 调用make_requests_from url(): Object to generate Requests Scrapy download and return response
  • parse() : Parsing response, and return to the Item or Requests (need to specify a callback function).

Item Item pipline passed persistence,

And Requests referred Scrapy downloaded by the specified callback function processing (default parse ()), it has been circulated until all the data have been processed so far.

Source Reference

# All reptiles base class, user-defined reptiles must inherit from this class 
class Spider (object_ref): 

    # define spider name string (string). spider Scrapy name defines how to locate (and initialize) spider, so it must be unique. 
    # Name spider is the most important attribute, but necessary. 
    # General practice is the site (domain) (or without suffix) named spider. For example, if a spider crawling mywebsite.com, the spider will usually be named mywebsite 
    name = None 

    # initialize, extract the name of reptiles, start_ruls 
    DEF  __init__ (Self, name = None, ** kwargs):
         IF name IS  not None: 
            Self .name = name
         # If the reptile has no name, the error interrupt subsequent operations 
        elif  not getattr (Self, ' name' , None):
             The raise a ValueError ( " % S MUST have have name A " .% Type (Self) the __name__ ) 

        # Python or objects by type members __dict__ built to store member information 
        . Self __dict__ .Update (kwargs) 

        # the URL list . When there is no specific URL, spider from the list to begin crawling. Therefore, URL of a page to be acquired will be one of the list. Subsequent URL will be extracted from the acquired data. 
        IF  Not the hasattr (Self, ' start_urls ' ): 
            self.start_urls = [] 

    # log information after performing printing Scrapy 
    DEF log (Self, Message, Level = log.debug, ** kW): 
        keyword arguments to log.msg (Message, SpiderSelf =, Level = Level, ** kW) 

    # determination target object property exists, there is no process to do the assertion 
    DEF set_crawler (Self, content crawler):
         Assert  Not the hasattr (Self, ' _crawler ' ), " Spider already bounded to% S " % content crawler 
        self._crawler = content crawler 

    @Property 
    DEF content crawler (Self):
         Assert the hasattr (Self, ' _crawler ' ), " Spider Not bounded to the any content crawler " 
        return self._crawler 

    @Property 
    DEFSettings (Self):
         return self.crawler.settings 

    # This method will read the address in the start_urls, and generates a Request object for each address, Scrapy to download and return the Response 
    # This method is called only once 
    DEF start_requests (Self) :
         for url in self.start_urls:
             yield self.make_requests_from_url (url) 

    # start_requests () call, the actual generating function of Request. 
    # Request object default callback function to parse (), submit to GET 
    DEF make_requests_from_url (Self, url):
         return Request (url, dont_filter = True) 

    # default Request object callback function, treatment response returned. 
    # Generate Item or Request object. Users must implement this class 
    def parse(self, response):
        raise NotImplementedError

    @classmethod
    def handles_request(cls, request):
        return url_is_from_spider(request.url, cls)

    def __str__(self):
        return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))

    __repr__ = __str__

The main properties and methods

  • name

    Definition of spider name string.

    For example, if crawling mywebsite.com spider, the spider is typically named mywebsite

  • allowed_domains

    It contains a list of spider crawling allowed domain name (domain), and optional.

  • start_urls

    Ganso initial URL / list. When not set a specific URL, spider from the list to begin crawling.

  • start_requests(self)

    The method must return an iterator objects (iterable). The first object contains a Request spider for crawling (the default implementation is to use start_urls url) a.

    When the spider starts crawling and start_urls not specified, the method is called.

  • parse(self, response)

    当请求url返回网页没有指定回调函数时,默认的Request对象回调函数。用来处理网页返回的response,以及生成Item或者Request对象。

  • log(self, message[, level, component])

    使用 scrapy.log.msg() 方法记录(log)message。 更多数据请参见 logging

案例

  • 创建一个新的爬虫:
scrapy genspider loaderman "cnblogs.com"
  • 编写items.py

获取信息字段

 

class LoadermanItem(scrapy.Item):
    title = scrapy.Field()
    detailUrl = scrapy.Field()
    content = scrapy.Field()
    date = scrapy.Field()
  • 编写LoadermanSpider.py
# -*- coding: utf-8 -*-
import scrapy

from scrapyDemo.items import LoadermanItem


class LoadermanSpider(scrapy.Spider):
    name = 'loaderman'
    allowed_domains = ['http://www.cnblogs.com']
    start_urls = ['http://www.cnblogs.com/loaderman']

    def parse(self, response):
        # filename = "loaderman.html"
        # open(filename, 'w').write(response.body)
        xpathList = response.xpath("//div[@class='post']")
        # items= []
        for each in xpathList:
            # 将我们得到的数据封装到一个 `LoadermanItem` 对象

            item = LoadermanItem()

            # extract()方法返回的都是unicode字符串
            title = each.xpath(".//h2/a[@class='postTitle2']/text()").extract()
            detailUrl = each.xpath(".//a[@class='postTitle2']/@href").extract()
            content = each.xpath(".//div[@class='c_b_p_desc']/text()").extract()
            date = each.xpath(".//p[@class='postfoot']/text()").extract()
            # xpath返回的是包含一个元素的列表

            item['title'] = title[0]
            item['detailUrl'] = detailUrl[0]
            item['content'] = content[0]
            item['date'] = date[0]
            # items.append(item)
            # #将获取的数据交给pipelines
            yield items

        # 返回数据,不经过pipeline
        # return items
  • 编写pipeline.py文件
     
      
    import json
    
    class LoadermanPipeline(object):
    
        def __init__(self):
            self.file = open('loaderman.json', 'w')
            # self.file.write("[")
    
        def process_item(self, item, spider):
    
            jsontext = json.dumps(dict(item), ensure_ascii=False) + " ,\n"
    
            self.file.write(jsontext.encode("utf-8"))
    
            return item
    
        def close_spider(self, spider):
            # self.file.write("]")
            self.file.close()
  • 在 setting.py 里设置ITEM_PIPELINES
ITEM_PIPELINES = {

    'scrapyDemo.pipelines.LoadermanPipeline': 300,
}
  • 执行爬虫:scrapy crawl loaderman

 parse()方法的工作机制:

1. 因为使用的yield,而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果,并判断该结果是一个什么样的类型;
2. 如果是request则加入爬取队列,如果是item类型则使用pipeline处理,其他类型则返回错误信息。
3. scrapy取到第一部分的request不会立马就去发送这个request,只是把这个request放到队列里,然后接着从生成器里获取;
4. 取尽第一部分的request,然后再获取第二部分的item,取到item了,就会放到对应的pipeline里处理;
5. parse()方法作为回调函数(callback)赋值给了Request,指定parse()方法来处理这些请求 scrapy.Request(url, callback=self.parse)
6. Request对象经过调度,执行生成 scrapy.http.response()的响应对象,并送回给parse()方法,直到调度器中没有Request(递归的思路)
7. 取尽之后,parse()工作结束,引擎再根据队列和pipelines中的内容去执行相应的操作;
8. 程序在取得各个页面的items前,会先处理完之前所有的request队列里的请求,然后再提取items。
7. 这一切的一切,Scrapy引擎和调度器将负责到底。

Guess you like

Origin www.cnblogs.com/loaderman/p/11890229.html
Recommended