Scrapy of common reptiles (Broad Crawls)

Copyright: arbitrary, seeding https://blog.csdn.net/qq_32662595/article/details/85233205!
  1. The definition
    can be crawling a large number of (or even unlimited) site, limited only by time or other restrictions
  2. Characteristics of
    a large number of its crawling (generally infinite) site rather than specific sites.
    b it will not crawl the entire site are completed, because it is impractical (or impossible) to complete. Instead, it will limit the time and the number of crawling.
    c which is simple in logic (compared to the extraction rule has many complex spider), after the data is carried out in a further processing stage (POST-Processed)
    D which is parallel crawling restriction is to avoid a large number of sites for a site the crawling speed limit (to show respect, each site crawling slowly crawling but many sites).
  3. The recommended setting

    Increase concurrency:

    Refers to the number of concurrent request is processed simultaneously. It has the global and local (per site) restrictions. By default Scrapy global concurrency limit while crawling on the large number of websites do not apply, so you need to increase this value. How much depends on how much CPU your reptile can occupy. General start can be set to 100. But the best way is to do some testing to get the process Scrapy obtain a relationship between the CPU and the number of concurrent. For optimal performance, you should choose to make a CPU occupancy rate of 80% -90% of the number of concurrent.

    Reduce log level

    When common crawling, you generally only pay attention to the rate of crawling and errors encountered. Scrapy use the INFO log level to report such information. In order to reduce CPU usage (and record log storage requirements), when GM crawling in a production environment, you should not use the DEBUG log level. But using DEBUG in the development should be acceptable.
    Prohibit cookies
    Please ban cookies. When performing common crawling cookies do not need, (search engines ignore cookies). Prohibit cookies can reduce CPU utilization and trace Scrapy reptiles recorded in memory to improve performance.

    Prohibit retry

    Failed HTTP request to retry crawling efficiency can slow down, especially when the site is responding slowly (even failure), visit this site will result in a timeout and try again several times. This is not necessary, but also the ability to take up a reptile crawling other sites.

    Reduce download times out

    You crawling (generally not important for common reptiles), to reduce the download time-out to make a fast and give up the ability to deal with other sites that can be connected to liberate jammed a very slow connection.

    Redirect ban

    Redirect you interested to follow up, otherwise please consider closing redirect. When crawling general, the general practice is to save the address of the redirection, and then analyzing the crawling. This ensures that each batch number request crawling in a certain amount, otherwise the redirection loop may cause the reptiles spend too much resources a site

    Enable "Ajax Crawlable Pages" crawling

    Some sites (based on empirical data for 2013, as much as 1%) stated as being ajax crawlable. This means that the site offers only pure HTML version of the original ajax acquired data. Web site statement in two ways:
    using a # in the url - This is the default mode;!
    Use a special meta tags - which use "main", "index" page.

Guess you like

Origin blog.csdn.net/qq_32662595/article/details/85233205