Scrapy-- change the configuration to improve performance

1. Increase concurrent

Refers to the number of concurrent request is processed simultaneously. It has the global and local (per site) restrictions.
By default Scrapy global concurrency limit while crawling on the large number of websites do not apply, so you need to increase this value. How much depends on how much CPU your reptile can occupy. General start can be set to 100. But the best way is to do some testing to get the process Scrapy obtain a relationship between the CPU and the number of concurrent. For optimal performance, you should choose to make a CPU occupancy rate of 80% -90% of the number of concurrent.

In the setting.pyfile write CONCURRENT_REQUESTS = 100, scrapy is the default number of concurrent 32

2. Increase the thread pool

scrapy to DNS queries through a thread pool, increase this thread pool can also improve the general performance scrapy.

REACTOR_THREADPOOL_MAXSIZE = 20

3. reduce log level

When common crawling, you generally only pay attention to the rate of crawling and errors encountered. Scrapy use the INFO log level to report such information. In order to reduce CPU usage (and record log storage requirements), when GM crawling in a production environment, you should not use the DEBUG log level. But using DEBUG in the development should be acceptable.
setting.pyFile settingsLOG_LEVEL = 'INFO'

4. Prohibition cookie

Unless you really need to, otherwise prohibit cookies. When performing common crawling cookies do not need, (search engines ignore cookies). Prohibit cookies can reduce CPU utilization and trace Scrapy reptiles recorded in memory to improve performance.
COOKIES_ENABLED = False

The retry prohibition

Failed HTTP request to retry crawling efficiency can slow down, especially when the site is responding slowly (even failure), visit this site will result in a timeout and try again several times. This is not necessary, but also the ability to take up a reptile crawling other sites.

RETRY_ENABLED = False

6. reduce the download time-out

If you are crawling (generally not important for common reptiles), to reduce the download time-out to make a quick connection can be stuck and give up the ability to deal with the liberation of other sites on a very slow connection.

DOWNLOAD_TIMEOUT = 15Download the timeout period, of which 15 are set

7. prohibit redirection

Unless you redirect interested in the follow-up, otherwise please consider closing redirect. When crawling general, the general practice is to save the address of the redirection, and then analyzing the crawling. This ensures that each batch number request crawling in a certain amount, otherwise the redirection loop may cause the reptiles spend too many resources in one site.

REDIRECT_ENABLED = False

8. Set delay

DOWMLOAD_DELY=3, Set the delay downloading avoid being found

9. Enable debugging tools

9.1 command-line debugger

scrapy shell url view web pages, but this approach may not work for pages to request headers, for the average web page is still possible
scrapy view shell is used to view Web pages dynamically loaded, if you view a Web page with dynamic loading, then use the command line to open this page is incomplete, certainly what's missing

9.2 Commissioning Editor

from scrapy.shell import inspect_response
def paser(self,response):
    inspect_response (response, self) # When the program is run here will jump out of the terminal, and a Debug command in a terminal, of course, this can easily write Where

10. The automatic adjustment of the load reptiles

scrapy has an extension that automatically adjusts the server load, which determines the best delay and other reptiles provided by an algorithm. It's in the document here .

Configuration as follows, click on the link to jump to the corresponding document.

11. Pause and resume reptiles

Beginners most difficult thing is not to deal with exceptions, when reptiles climbed half when suddenly interrupted because of an error, but then they can not continue to climb from the start where it left off, but there is a way to temporarily store your climb the state, when the reptiles continue to open after the interruption can still climb from where it left off, but although persistence can effectively deal with, but be aware that when using analog temporary cookie login state to pay attention to the cookie period

Only need setting.pyin JOB_DIR=file_name which you fill in the file directory, note the directory where sharing is not permitted, can only store a separate operational status of a spider, if you do not want to run from where it left off, simply delete this folder to

Of course, there are other methods of release: scrapy crawl somespider -s JOBDIR=crawls/somespider-1This is called when the terminal starts reptiles, you can ctr+csuspend and resume or enter the above command.