Crawler Configuration
The default configuration Colly is to grab a small amount in one job site optimized. If you want to crawl millions of sites, this setting is not the best. Here are some adjustments:
Use persistent storage backend
By default, Colly the cookie and visited url stored in memory. You can use any memory storage back-end custom built to replace the back-end. For details, please click here .
Use long-running asynchronous processing of recursive calls
By default, Colly when the request is not complete obstruction, so recursive calls Collector. Callback visit generated a growing stack. collector. Async = true this can be avoided. (Do not forget to use c.Wait (in the async).)
Disable or limit the connection keep-alive
Colly using HTTP keep-alive to increasing the rate. It needs an open file descriptor, so long-running jobs easily reach max-fd limit.
HTTP Keep-alive can be disabled by the following code:
c := colly.NewCollector() c.WithTransport(&http.Transport{ DisableKeepAlives: true, })