colly distributed crawling ## 5

Distributed crawling

According to the needs crawling tasks can be distributed crawling achieved in different ways. In most cases, the extended network communication layer is sufficient, the use of a proxy switch and Colly agent can easily achieve this

 

Acting converter

When the HTTP request is distributed among multiple agents, the use of a proxy switch remains focused crawling. Colly support through its proxy switch 'SetProxyFunc () member. Any custom functions are available through func (* http.Request) (* url.URL, error).

Note : SSH server can be used as socks5 proxy with the -D flag.

Colly has a built proxy switch, which can be rotated in accordance with each request proxy list.

use

package main

import (
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/proxy"
)

func main() {
	c := colly.NewCollector()

	if p, err := proxy.RoundRobinProxySwitcher(
		"socks5://127.0.0.1:1337",
		"socks5://127.0.0.1:1338",
		"http://127.0.0.1:8080",
	); err == nil {
		c.SetProxyFunc(p)
	}
	// ...
}

Implement a custom proxy switch:

var proxies []*url.URL = []*url.URL{
	&url.URL{Host: "127.0.0.1:8080"},
	&url.URL{Host: "127.0.0.1:8081"},
}

func randomProxySwitcher(_ *http.Request) (*url.URL, error) {
	return proxies[random.Intn(len(proxies))], nil
}

// ...
c.SetProxyFunc(randomProxySwitcher)

 

Distributed crawling

To manage and distributed independent of the scraper, the best thing you can do is to scraper package to the server. The server can be any type of service, such as HTTP, TCP server or Google App Engine. Use custom storage and centralized access url persistent cookie handling.

Note : Colly has built-in support for Google App Engine. If you use the standard Colly in App Engine environment, do not forget to call Collector.Appengine (* http.Request).

Here you can find an example of realization.

 

Distributed Storage

By default, visited URL and cookie data is stored in memory. It is convenient to grab short-term work, but it can be a serious limitation when crawling job processing large-scale or long-running.

Colly default memory storage can be replaced with any storage backend implementation Colly / storage of. Storage interface. View existing storage .

Guess you like

Origin www.cnblogs.com/liujie-php/p/11571048.html