colly Crawler configuration ## 9

Crawler Configuration

The default configuration Colly is to grab a small amount in one job site optimized. If you want to crawl millions of sites, this setting is not the best. Here are some adjustments:

Use persistent storage backend

By default, Colly the cookie and visited url stored in memory. You can use any memory storage back-end custom built to replace the back-end. For details, please click here .

 

Use long-running asynchronous processing of recursive calls

By default, Colly when the request is not complete obstruction, so recursive calls Collector. Callback visit generated a growing stack. collector. Async = true this can be avoided. (Do not forget to use c.Wait (in the async).)

 

Disable or limit the connection keep-alive

Colly using HTTP keep-alive to increasing the rate. It needs an open file descriptor, so long-running jobs easily reach max-fd limit.

HTTP Keep-alive can be disabled by the following code:

c := colly.NewCollector()
c.WithTransport(&http.Transport{
    DisableKeepAlives: true,
})

  

Guess you like

Origin www.cnblogs.com/liujie-php/p/11571153.html