Scarpy not only provides the scrapy crawl spider command to start the crawler, but also provides a method to use the API to write scripts to start the crawler.
scrapy is built on top of the twisted asynchronous network library, so it needs to be run inside the twisted container.
Crawlers can be run through two APIs: scrapy.crawler.CrawlerProcess and scrapy.crawler.CrawlerRunner
scrapy.crawler.CrawlerProcess
This class will open twisted.reactor, configure log and set twisted.reactor to close automatically. This class is used by all scrapy commands.
Run a single crawler example
class QiushispiderSpider(scrapy.Spider):
name = 'qiushiSpider'
# allowed_domains = ['qiushibaike.com']
start_urls = ['https://tianqi.2345.com/']
def start_requests(self):
return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] #
def parse(self, response):
print('proxy simida')
if __name__ == '__main__':
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(QiushispiderSpider) # 'qiushiSpider'
process.start()
The parameters in process.crawl() can be the crawler name ' qiushiSpider ', or the crawler class name QiushispiderSpider
This method does not use the configuration file settings of the crawler
2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}
get configuration
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl(QiushispiderSpider) # 'qiushiSpider'
process.start()
run multiple crawlers
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
scrapy.crawler.CrawlerRunner
- Better control of crawler running process
- Explicitly run twisted.reactor, explicitly close twisted.reactor
- A callback function needs to be added to the object returned by CrawlerRunner.crawl
Run a single crawler example
class QiushispiderSpider(scrapy.Spider):
name = 'qiushiSpider'
# allowed_domains = ['qiushibaike.com']
start_urls = ['https://tianqi.2345.com/']
def start_requests(self):
return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] #
def parse(self, response):
print('proxy simida')
if __name__ == '__main__':
# test CrawlerRunner
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl(QiushispiderSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
configure_logging set the log output format
addBoth Add a callback function to close twisted.crawl
run multiple crawlers
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
It can also be implemented asynchronously
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
...
class MySpider2(scrapy.Spider):
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script