Artifact! Collyx, the fastest crawler framework, is open source today!

1. Introduction

Foreword: Colly is a well-known crawler framework implemented by Go, and the advantages of Go in high concurrency and distributed scenarios are exactly what crawler technology needs. Its main features are lightweight, fast, very elegant design, and the distributed support is also very simple and easy to expand.

github address : github.com/gocolly/colly

colly official website address: http://go-colly.org/

From the picture above, we can see that colly is very popular in the github community. Today we are about to introduce the collyx crawler framework. Next, I will introduce this framework to readers through source code sharing.


2. Introduction to the collyx framework

Framework introduction: A configurable distributed crawler architecture implemented based on colly framework and net/http encapsulation. Users only need to configure parameters such as parsing, concurrent number, storage topic, request method, and request url. Other codes are similar to scrapy and do not need to be written separately.

Advantages of the framework: implements a retry mechanism, each function is pluggable, custom analysis module, structure module, etc., abstracts the scheduling module, greatly reduces code redundancy, and quickly improves development capabilities; among them, crawlers with concurrent feed streams can also Effective, not only based on depth-first crawlers; it can also be used for breadth-first.

Collyx architecture diagram preview:


3. Source code sharing

According to the above architecture diagram, we can divide the framework into six components, namely: spiders, engine, items, downloader, pipelines, scheduler. Below, we will explain the entire source code of collyx from these parts one by one, and also show part of the extensions source code. The complete catalog is as follows:

1. Spiders module sharing, custom code structure, the code is as follows:

// Package spiders ---------------------------// @author    : TheWeiJunpackage main
import (  "collyx-spider/common"  "collyx-spider/items/http"  "collyx-spider/pipelines"  "collyx-spider/spiders/crawler")
func main() {
   
     request := http.FormRequest{
   
       Url:         "https://xxxxx",    Payload:     "xxxxx",    Method:      "POST",    RedisKey:    "ExplainingGoodsChan",    RedisClient: common.LocalRedis,    RedisMethod: "spop",    Process:     pipelines.DemoParse,    Topic:       "test",  }  crawler.Crawl(&request)
}

Note : You only need to configure parameters such as url, payload, method, redis, and kafka to be captured; if you do not want to use some parameters, you can remove them.

2. The source code of the engine module is as follows, and configure the initialization parameters for colly:​​​​​​​​

package engine
import (  "collyx-spider/common"  downloader2 "collyx-spider/downloader"  extensions2 "collyx-spider/extensions"  "collyx-spider/items/http"  "collyx-spider/scheduler"  "github.com/gocolly/colly"  "time")
var Requests = common.GetDefaultRequests()var TaskQueue = common.GetDefaultTaskQueue()var Proxy = common.GetDefaultProxy()var KeepAlive = common.GetDefaultKeepAlive()var kafkaStatus = common.GetKafkaDefaultProducer()var RequestChan = make(chan bool, Requests)var TaskChan = make(chan interface{}, TaskQueue)
func CollyConnect(request *http.FormRequest) {
   
     var c = colly.NewCollector(    colly.Async(true),    colly.AllowURLRevisit(),  )  c.Limit(&colly.LimitRule{
   
       Parallelism: Requests,    Delay:       time.Second * 3,    RandomDelay: time.Second * 5,  })  if Proxy {
   
       extensions2.SetProxy(c, KeepAlive)  }  //if kafkaStatus {
   
     //  common.InitDefaultKafkaProducer()  //}  extensions2.URLLengthFilter(c, 10000)  downloader2.ResponseOnError(c, RequestChan)  downloader2.DownloadRetry(c, RequestChan)  request.SetConnect(c)  request.SetTasks(TaskChan)  request.SetRequests(RequestChan)}
func StartRequests(request *http.FormRequest) {
   
     /*add headers add parse*/  go scheduler.GetTaskChan(request)  if request.Headers != nil {
   
       request.Headers(request.Connect)  } else {
   
       extensions2.GetHeaders(request.Connect)  }  downloader2.Response(request)
}

Description: This module is mainly used for operations such as initializing the scheduler, requesting headers extension, initializing the downloader, and initializing colly. It is one of the important modules for the framework to run.

3. The source code display of the scheduler module, the complete code:​​​​

package scheduler
import (  "collyx-spider/items/http"  log "github.com/sirupsen/logrus"  "strings"  "time")
func GetTaskChan(request *http.FormRequest) {
   
     redisKey := request.RedisKey  redisClient := request.RedisClient  redisMethod := request.RedisMethod  limits := int64(cap(request.TasksChan))  TaskChan := request.TasksChan  methodLowerStr := strings.ToLower(redisMethod)  for {
   
       switch methodLowerStr {
   
       case "do":      result, _ := redisClient.Do("qpop", redisKey, 0, limits).Result()      searchList := result.([]interface{})      if len(searchList) == 0 {
   
           log.Debugf("no task")        time.Sleep(time.Second * 3)        continue      }      for _, task := range searchList {
   
           TaskChan <- task      }    case "spop":      searchList, _ := redisClient.SPopN(redisKey, limits).Result()      if len(searchList) == 0 {
   
           log.Debugf("no task")        time.Sleep(time.Second * 3)        continue      }      for _, task := range searchList {
   
           TaskChan <- task      }    default:      log.Info("Methods are not allowed.....")    }    time.Sleep(time.Second)  }}

Explanation : Here, the value is obtained from the structure pointer in the spider, and the obtained task is handed over to the TaskChan channel for task distribution.

4. Display of the source code of the items module

4.1 The request_struct.go module code is as follows:​​​​​​​​

package http
import (  "github.com/go-redis/redis"  "github.com/gocolly/colly")
type FormRequest struct {
   
     Url          string  Payload      string  Method       string  RedisKey     string  RedisClient  *redis.Client  RedisMethod  string  GetParamFunc func(*FormRequest)  Connect      *colly.Collector  Process      func([]byte, string, string) string  RequestChan  chan bool  TasksChan    chan interface{}  Topic        string  Headers      func(collector *colly.Collector)  TaskId       string}
func (request *FormRequest) SetRequests(requests chan bool) {
   
     request.RequestChan = requests}
func (request *FormRequest) SetTasks(tasks chan interface{}) {
   
     request.TasksChan = tasks}
func (request *FormRequest) SetConnect(conn *colly.Collector) {
   
     request.Connect = conn}
func (request *FormRequest) SetUrl(url string) {
   
     request.Url = url}

Summary: The request structure is responsible for spiders request customization and setting initialization request parameters.

4.2 Parse the structure, customize the structure according to the parsed content and saved content, the screenshot is as follows:

5. The downloader module is shared, and the directory code structure is shown in the figure below:

Summary: This module has three functions: download success, download error, and download retry. Next, share the source code.

5.1 The download_error.go code is as follows:​​​​​​​​

package downloader
import (  "github.com/gocolly/colly")
func ResponseOnError(c *colly.Collector, taskLimitChan chan bool) {
   
     c.OnError(func(r *colly.Response, e error) {
   
       defer func() {
   
         <-taskLimitChan    }()  })
  c.OnScraped(func(r *colly.Response) {
   
       defer func() {
   
         <-taskLimitChan    }()
  })}

Module Description: This module captures error requests and releases concurrent channels in time.

 

5.2 The download_ok.go code is as follows:​​​​​​​​

package downloader
import (  "collyx-spider/common"  "collyx-spider/items/http"  "github.com/gocolly/colly")
func Response(request *http.FormRequest) {
   
     c := request.Connect  c.OnResponse(func(response *colly.Response) {
   
       defer common.CatchError()    task := response.Ctx.Get("task")    isNext := request.Process(response.Body, task, request.Topic)    if isNext != "" {
   
         request.RedisClient.SAdd(request.RedisKey, isNext)    }  })}

Module description: This module is for processing 200 status code requests, and will call the analysis function defined in advance by spiders for data extraction.

5.3 The download_rety.go code is as follows:​​​​​​​​

package downloader
import (  "collyx-spider/common"  "github.com/gocolly/colly"  "log")
func RetryFunc(c *colly.Collector, request *colly.Response, RequestChan chan bool) {
   
     url := request.Request.URL.String()  body := request.Request.Body  method := request.Request.Method  ctx := request.Request.Ctx  RequestChan <- true  c.Request(method, url, body, ctx, nil)}
func DownloadRetry(c *colly.Collector, RequestChan chan bool) {
   
     c.OnError(func(request *colly.Response, e error) {
   
       if common.CheckErrorIsBadNetWork(e.Error()) {
   
         taskId := request.Request.Ctx.Get("task")      log.Printf("Start the retry task:%s", taskId)      RetryFunc(c, request, RequestChan)    }  })
}

Module description: capture the error type through the custom error function, and open the retry mechanism to retry, which makes up for the missing data problem of colly's request failure.

6. Pipelines module sharing, the complete code is as follows:​​​​​​​​

package pipelines
import (  "collyx-spider/common"  "collyx-spider/items"  "encoding/json"  log "github.com/sirupsen/logrus")
func DemoParse(bytes []byte, task, topic string) string {
   
     item := items.Demo{}  json.Unmarshal(bytes, &item)  Promotions := item.Promotions  if Promotions != nil {
   
       data := Promotions[0].BaseInfo.Title    proId := Promotions[0].BaseInfo.PromotionId    common.KafkaDefaultProducer.AsyncSendWithKey(task, topic, data+proId)    log.Println(data, Promotions[0].BaseInfo.PromotionId, topic)  } else {
   
       log.Println(Promotions)  }  return ""}

Module description: The positions of pipelines in the framework are mainly responsible for data analysis and data persistence operations.

7. Cralwer module sharing, the code is as follows:​​​​​​​​

package crawler
import (  "collyx-spider/common"  "collyx-spider/engine"  "collyx-spider/items/http"  "fmt"  "github.com/gocolly/colly"  log "github.com/sirupsen/logrus"  "strings"  "time")
func MakeRequestFromFunc(request *http.FormRequest) {
   
     for true {
   
       select {
   
       case TaskId := <-request.TasksChan:      ctx := colly.NewContext()      ctx.Put("task", TaskId)      request.TaskId = TaskId.(string)      if request.Method == "POST" {
   
           request.GetParamFunc(request)        if strings.Contains(TaskId.(string), ":") {
   
             split := strings.Split(TaskId.(string), ":")          TaskId = split[0]          data := fmt.Sprintf(request.Payload, TaskId)          ctx.Put("data", data)          request.Connect.Request(request.Method, request.Url, strings.NewReader(data), ctx, nil)        }        request.RequestChan <- true      } else {
   
           if strings.Contains(TaskId.(string), "http") {
   
             request.Url = TaskId.(string)        } else {
   
             request.GetParamFunc(request)        }        request.Connect.Request(request.Method, request.Url, nil, ctx, nil)        request.RequestChan <- true      }    default:      time.Sleep(time.Second * 3)      log.Info("TaskChan not has taskId")    }  }}
func MakeRequestFromUrl(request *http.FormRequest) {
   
     for true {
   
       select {
   
       case TaskId := <-request.TasksChan:      ctx := colly.NewContext()      ctx.Put("task", TaskId)      if request.Method == "POST" {
   
           payload := strings.NewReader(fmt.Sprintf(request.Payload, TaskId))        request.Connect.Request(request.Method, request.Url, payload, ctx, nil)      } else {
   
           fmt.Println(fmt.Sprintf(request.Url, TaskId))        request.Connect.Request(request.Method, fmt.Sprintf(request.Url, TaskId), nil, ctx, nil)      }      request.RequestChan <- true    default:      time.Sleep(time.Second * 3)      log.Info("TaskChan not has taskId.......")    }  }}
func RequestFromUrl(request *http.FormRequest) {
   
     if request.GetParamFunc != nil {
   
       MakeRequestFromFunc(request)  } else {
   
       MakeRequestFromUrl(request)  }}
func Crawl(request *http.FormRequest) {
   
     /*making requests*/  engine.CollyConnect(request)  engine.StartRequests(request)  go RequestFromUrl(request)  common.DumpRealTimeInfo(len(request.RequestChan))}

Summary: The cralwer module is mainly responsible for engine initialization and engine signal transmission, driving the entire crawler project to run.

 

8. The extensions module, the screenshot of the code directory is as follows:

8.1 The source code of AddHeaders.go is as follows:​​​​​​​​

package extensions
import (  "fmt"  "github.com/gocolly/colly"  "math/rand")
var UaGens = []func() string{
   
     genFirefoxUA,  genChromeUA,}
var ffVersions = []float32{
   
     58.0,  57.0,  56.0,  52.0,  48.0,  40.0,  35.0,}
var chromeVersions = []string{
   
     "65.0.3325.146",  "64.0.3282.0",  "41.0.2228.0",  "40.0.2214.93",  "37.0.2062.124",}
var osStrings = []string{
   
     "Macintosh; Intel Mac OS X 10_10",  "Windows NT 10.0",  "Windows NT 5.1",  "Windows NT 6.1; WOW64",  "Windows NT 6.1; Win64; x64",  "X11; Linux x86_64",}
func genFirefoxUA() string {
   
     version := ffVersions[rand.Intn(len(ffVersions))]  os := osStrings[rand.Intn(len(osStrings))]  return fmt.Sprintf("Mozilla/5.0 (%s; rv:%.1f) Gecko/20100101 Firefox/%.1f", os, version, version)}
func genChromeUA() string {
   
     version := chromeVersions[rand.Intn(len(chromeVersions))]  os := osStrings[rand.Intn(len(osStrings))]  return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", os, version)}
func GetHeaders(c *colly.Collector) {
   
     c.OnRequest(func(r *colly.Request) {
   
       r.Headers.Set("User-Agent", UaGens[rand.Intn(len(UaGens))]())  })}

Module Description: Responsible for replacing random ua to prevent the spider from being gank by the website.

8.2 The source code of AddProxy.go is as follows:​​​​​​​​

package extensions
import (  "collyx-spider/common"  "github.com/gocolly/colly"  "github.com/gocolly/colly/proxy")
func SetProxy(c *colly.Collector, KeepAlive bool) {
   
     proxyList := common.RefreshProxies()  if p, err := proxy.RoundRobinProxySwitcher(    proxyList...,  ); err == nil {
   
       c.SetProxyFunc(p)  }}

Module description: This module is mainly to set a proxy for the request to prevent errors such as request failure.

8.3 URLLengthFilter.go source code sharing:​​​​​​​​

package extensions
import "github.com/gocolly/colly"
func URLLengthFilter(c *colly.Collector, URLLengthLimit int) {
   
     c.OnRequest(func(r *colly.Request) {
   
       if len(r.URL.String()) > URLLengthLimit {
   
         r.Abort()    }  })}

Module description: Discard the request with too long url, and Abort cancels the HTTP request in the OnRequest callback. This is the end of the source code sharing session. Next, let's run the code and show the performance of the collyx crawler framework!


4. Framework demo display

1. Start the edited case code, and run the screenshot as follows:

Summary: After the crawler runs for 5 minutes, if the proxy is sufficient enough, the crawling of the website generates about 2000 pieces of data per minute. It can be said without bragging that this is the fastest crawler framework I have seen so far.


5. Experience sharing

Today's sharing is over here, and there is still a long way to go for the collyx framework. I always feel that as long as we work hard, we will achieve our goals step by step. Finally, thank you for your patience in reading this article!

Guess you like

Origin blog.csdn.net/y1282037271/article/details/129200454