scrapy 4 learning crawl spider

Recap:

    A: the picture lazy loading (lazy loading face picture how to do)

      

          --- set the picture loaded with selenium location

      --- analysis lazy loaded property, direct access

    two:

      How to improve crawling efficiency scrapy

Increase concurrency:
default scrapy turned to 32 concurrent threads, may be appropriately increased. Modify CONCURRENT_REQUESTS = 100 settings value 100 in the configuration file, and transmits the set to become 100.

Reduce log level:
    When you run scrapy, there will be a lot of log output information, in order to reduce CPU usage. Log output information may be provided or INFO to ERROR. Written in the configuration file: LOG_LEVEL = 'INFO'

ban cookie:
    If the cookie is not really needed, at the time of scrapy crawling can disable cookie data to reduce CPU usage, improve crawl efficiency. Written in the configuration file: COOKIES_ENABLED = False

prohibited Retry:
    for re failed HTTP request (retry) will slow crawling speed, retry can be prohibited. Written in the configuration file: RETRY_ENABLED = False

reduce the download time-out:
    If a very slow crawling links, reduce the download time-out can make the jammed fast link was abandoned, thereby enhancing efficiency. Be written in the configuration file: DOWNLOAD_TIMEOUT = 10 timeout to 10s

 

 

 

 

    Three: crawlSpider the station crawling

Guess you like

Origin www.cnblogs.com/baili-luoyun/p/10969422.html