[Scrapy log management framework and crawling efficiency] --2019-08-09 10:11:34

Original: http://106.13.73.98/__/140/

Log level

  • ERROR : General error
  • The WARNING : Warnings
  • INFO : General Information
  • DEBUG : debug information

Log Management

In settings.py write the following two configuration profile to log:

# 指定日志等级
LOG_LEVEL = 'ERROR'

# 指定日志存储文件
LOG_FILE = 'log.txt'
# 如果你指定了日志存储文件,则终端不再打印日志信息

Crawling efficiency


1.增加并发
Scrapy enabled by default for the concurrent threads 32 Ge, may be appropriately increased.
In settings.py configuration file by CONCURRENT_REQUESTS specify the number of concurrent.


2.降低日志级别
When you run Scrapy, there will be a lot of log information output, we can set a log level INFO or ERROR to reduce CPU usage.
In settings.py configuration file by LOG_LEVEL specified log level.


3.禁用cookie
If it is not really required cookie, it can be cut out, so as to enhance CPU utilization, improve crawl efficiency, it is disabled by default.
In settings.py configuration file by COOKIES_ENABLED to enable or disable the cookie.


4.禁用重试
Re of failed HTTP request (retry) will slow crawling speed, retry can be disabled.
In settings.py configuration file by RETRY_ENABLED to enable or disable retry.


5.减少下载超时时间
If a link to a very slow crawling to reduce the download time-out lets stuck fast link was abandoned, so as to enhance the efficiency of crawling.
In settings.py configuration file by DOWNLOAD_TIMEOUT to specify the timeout time (in seconds).

Added: Specifies the condition to end the reptiles

  1. CLOSESPIDER_TIMEOUT Specifies the time (in seconds) after the end of the reptiles
  2. CLOSESPIDER_ITEMCOUNT After grasping the end of a specified number of reptiles Item
  3. CLOSESPIDER_PAGECOUNT Upon receipt of a response to the end of the specified number of reptiles
  4. CLOSESPIDER_ERRORCOUNT After a specified number of errors occurred in the end of reptiles

Original: http://106.13.73.98/__/140/

Guess you like

Origin www.cnblogs.com/gqy02/p/11325413.html