1. Increase concurrent
Refers to the number of concurrent request is processed simultaneously. It has the global and local (per site) restrictions.
By default Scrapy global concurrency limit while crawling on the large number of websites do not apply, so you need to increase this value. How much depends on how much CPU your reptile can occupy. General start can be set to 100. But the best way is to do some testing to get the process Scrapy obtain a relationship between the CPU and the number of concurrent. For optimal performance, you should choose to make a CPU occupancy rate of 80% -90% of the number of concurrent.
In the setting.py
file write CONCURRENT_REQUESTS = 100
, scrapy is the default number of concurrent 32
2. Increase the thread pool
scrapy to DNS queries through a thread pool, increase this thread pool can also improve the general performance scrapy.
REACTOR_THREADPOOL_MAXSIZE = 20
3. reduce log level
When common crawling, you generally only pay attention to the rate of crawling and errors encountered. Scrapy use the INFO log level to report such information. In order to reduce CPU usage (and record log storage requirements), when GM crawling in a production environment, you should not use the DEBUG log level. But using DEBUG in the development should be acceptable. setting.py
File settingsLOG_LEVEL = 'INFO'
4. Prohibition cookie
Unless you really need to, otherwise prohibit cookies. When performing common crawling cookies do not need, (search engines ignore cookies). Prohibit cookies can reduce CPU utilization and trace Scrapy reptiles recorded in memory to improve performance.COOKIES_ENABLED = False
The retry prohibition
Failed HTTP request to retry crawling efficiency can slow down, especially when the site is responding slowly (even failure), visit this site will result in a timeout and try again several times. This is not necessary, but also the ability to take up a reptile crawling other sites.
RETRY_ENABLED = False
6. reduce the download time-out
If you are crawling (generally not important for common reptiles), to reduce the download time-out to make a quick connection can be stuck and give up the ability to deal with the liberation of other sites on a very slow connection.
DOWNLOAD_TIMEOUT = 15
Download the timeout period, of which 15 are set
7. prohibit redirection
Unless you redirect interested in the follow-up, otherwise please consider closing redirect. When crawling general, the general practice is to save the address of the redirection, and then analyzing the crawling. This ensures that each batch number request crawling in a certain amount, otherwise the redirection loop may cause the reptiles spend too many resources in one site.
REDIRECT_ENABLED = False
8. Set delay
DOWMLOAD_DELY=3
, Set the delay downloading avoid being found
9. Enable debugging tools
9.1 command-line debugger
-
scrapy shell url view web pages, but this approach may not work for pages to request headers, for the average web page is still possible
-
scrapy view shell is used to view Web pages dynamically loaded, if you view a Web page with dynamic loading, then use the command line to open this page is incomplete, certainly what's missing
9.2 Commissioning Editor
from scrapy.shell import inspect_response def paser(self,response): inspect_response (response, self) # When the program is run here will jump out of the terminal, and a Debug command in a terminal, of course, this can easily write Where
10. The automatic adjustment of the load reptiles
scrapy has an extension that automatically adjusts the server load, which determines the best delay and other reptiles provided by an algorithm. It's in the document here .
Configuration as follows, click on the link to jump to the corresponding document.
AUTOTHROTTLE_ENABLED
AUTOTHROTTLE_START_DELAY
AUTOTHROTTLE_MAX_DELAY
AUTOTHROTTLE_TARGET_CONCURRENCY
AUTOTHROTTLE_DEBUG
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
11. Pause and resume reptiles
Beginners most difficult thing is not to deal with exceptions, when reptiles climbed half when suddenly interrupted because of an error, but then they can not continue to climb from the start where it left off, but there is a way to temporarily store your climb the state, when the reptiles continue to open after the interruption can still climb from where it left off, but although persistence can effectively deal with, but be aware that when using analog temporary cookie login state to pay attention to the cookie period
Only need setting.py
in JOB_DIR=file_name
which you fill in the file directory, note the directory where sharing is not permitted, can only store a separate operational status of a spider, if you do not want to run from where it left off, simply delete this folder to
Of course, there are other methods of release: scrapy crawl somespider -s JOBDIR=crawls/somespider-1
This is called when the terminal starts reptiles, you can ctr+c
suspend and resume or enter the above command.