Scrapy API starts the crawler

Scarpy not only provides the scrapy crawl spider command to start the crawler, but also provides a method to use the API to write scripts to start the crawler.

scrapy is built on top of the twisted asynchronous network library, so it needs to be run inside the twisted container.

Crawlers can be run through two APIs: scrapy.crawler.CrawlerProcess and scrapy.crawler.CrawlerRunner

scrapy.crawler.CrawlerProcess

This class will open twisted.reactor, configure log and set twisted.reactor to close automatically. This class is used by all scrapy commands.

Run a single crawler example

class QiushispiderSpider(scrapy.Spider):
    name = 'qiushiSpider'
    # allowed_domains = ['qiushibaike.com']
    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):
        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):
        print('proxy simida')


if __name__ == '__main__':
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()
    process.crawl(QiushispiderSpider)         # 'qiushiSpider'
    process.start()

The parameters in process.crawl() can be the crawler name ' qiushiSpider ', or the crawler class name QiushispiderSpider

This method does not use the configuration file settings of the crawler

2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}

get configuration

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(QiushispiderSpider)         # 'qiushiSpider'
process.start()

run multiple crawlers

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

scrapy.crawler.CrawlerRunner

  1. Better control of crawler running process
  2. Explicitly run twisted.reactor, explicitly close twisted.reactor
  3. A callback function needs to be added to the object returned by CrawlerRunner.crawl

Run a single crawler example

class QiushispiderSpider(scrapy.Spider):
    name = 'qiushiSpider'
    # allowed_domains = ['qiushibaike.com']
    start_urls = ['https://tianqi.2345.com/']          

    def start_requests(self):
        return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #

    def parse(self, response):
        print('proxy simida')


if __name__ == '__main__':
    # test CrawlerRunner
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings

    configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
    runner = CrawlerRunner(get_project_settings())

    d = runner.crawl(QiushispiderSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

configure_logging set the log output format

addBoth Add a callback function to close twisted.crawl

run multiple crawlers

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

It can also be implemented asynchronously

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    ...

class MySpider2(scrapy.Spider):
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script 

Guess you like

Origin blog.csdn.net/weixin_41951954/article/details/130005805