Go language combat crawler project

Because I want to build a crawler system, I use python, but I finally found that the efficiency is very low. I happened to meet a Go god. He suggested that Go try it. The effect is not bad, so I added some information!

Colly and Goquery in Go crawler framework

Python crawler frameworks mostly include requests, urllib, pyquery, scrapy, etc., and parsing libraries include BeautifulSoup, pyquery, Scrapy, and lxml, etc. The Go-based crawler framework is relatively robust, especially Colly and Goquery are relatively powerful tools and flexible Both sex and expressiveness are excellent.

Web Crawler

What is a web crawler? Essentially, a web crawler works by checking the HTML content of a web page and performing certain types of actions based on the content. Usually, crawling exposed links, crawlers follow the queue to crawl. We can also save the data extraction from the current page. For example, if we start on a Wikipedia page, we might save the text and title of the page.

Simple algorithm for crawlers

initialize Queue
enqueue SeedURL

while Queue is not empty:
    URL = Pop element from Queue
    Page = Visit(URL)
    Links = ExtractLinks(Page)
    Enqueue Links on Queue
12345678

The Visit and ExtractLinks functions are the changes, and the application of the two functions is specific. Our crawler will try to explain the entire web map, just like google, or as simple as Wikipedia.

As the use cases you use increase, many things will become complicated, many, many pages will be crawled, you may need a more sophisticated crawler to run at the same time, for more complex pages, you need a more powerful HTML Interpreter.

Colly

Colly is a flexible crawler framework based on the Go language. Out of the box, you will get some rate limiting, parallel crawling and other support.
One of the basic components of Colly is Collector. Collector keeps track of the pages that need to be crawled, and keeps callbacks when the page is crawled.

One, start

It's easy to create a Collector, but we have many options we can use.

1

2

3

4

5

6

7

8

9

c := colly.NewCollector(

    // Restrict crawling to specific domains

    colly.AllowedDomains("godoc.org"),

    // Allow visiting the same page multiple times

    colly.AllowURLRevisit(),

    // Allow crawling to be done in parallel / async

    colly.Async(true),

)

12345678

You can just colly.NewCollector(), and then add those optional items yourself.

We can also use some special restrictions to make our crawler behave like a well-behaved Internet citizen. It is simple for Colly to add a rate limit.

1

2

3

4

5

6

7

8

9

c.Limit(&colly.LimitRule{

    // Filter domains affected by this rule

    DomainGlob:  "godoc.org/*",

    // Set a delay between requests to these domains

    Delay: 1 * time.Second

    // Add an additional random delay

    RandomDelay: 1 * time.Second,

})

12345678

  

Some web pages may be picky about high-traffic visits, and they will disconnect you. Usually setting a delay for a few seconds can make you a little bit farther from the naughty list.

From here, we can start our collector through a URL seed.

1

c.Visit("https://godoc.org")

Two, OnHTML

We have a good collector who can start working from any website. Now if we want our collector to do something, he needs to check the page to extract links and other data.
The colly.Collector.OnHTML method allows you to register a callback for a specific HTML tag specifier when the collector reaches the matching part of the page. First, we can get a callback when the crawler sees the [tag contains an href link. ]()

1

2

3

4

5

6

7

c.OnHTML("a[href]"func(e *colly.HTMLElement) {

    // Extract the link from the anchor HTML element   

    link := e.Attr("href")

    // Tell the collector to visit the link

    c.Visit(e.Request.AbsoluteURL(link))

})

123456

As seen above, in this callback you get a colly.HTMLElement which contains the data of the matched HTML.
Now, we have the beginning of an actual web crawler: we find the link visits on the page and tell our collector to visit these links in subsequent requests.
OnHTML is a powerful tool. It can search for CSS selectors (ie div.my_fancy_class or #someElementId), and you can connect multiple OnHTML callbacks to your collector to handle different types of pages.
Colly's HTMLElement structure is very useful. In addition to using the Attr function to obtain those attributes, you can also extract text. For example, we may want to print the title of the page:

1

2

3

4

c.OnHTML("title"func(e *colly.HTMLElement) {

    fmt.Println(e.Text)

})

123

三、OnRequest / OnResponse

有些时候你不需要一个特定的HTML元素从一个页面,而是想知道当你的爬虫检索或刚刚检索页面。为此,Colly暴露OnRequest OnResponse回调。
所有这些回调将被调用当访问到每个页面的时候。至于如何在符合OnHTML的使用要求。回调被调用的时候有一些顺序:1。OnRequest 2。OnResponse 3。OnHTML 4。OnScraped(在这边文章中没有提及到,但可能对你有用)
尤其使用的是OnRequest中止回调的能力。这可能是有用的,当你想让你的collector停止。

1

2

3

4

c.OnHTML("title"func(e *colly.HTMLElement) {

    fmt.Println(e.Text)

})

123

在OnResponse,可以访问整个HTML文档,这可能是有用的在某些情况下:

1

2

3

4

c.OnResponse(func(r *colly.Response) {

    fmt.Println(r.Body)

})

123

四、HTMLElement

除了colly.HTMLElement的Attr()方法和text,我们还可以使用它来遍历子元素。ChildText(),ChildAttr()特别是ForEach()方法非常有用。
例如,我们可以使用ChildText()获得所有段落的文本部分:

1

2

3

4

c.OnHTML("#myCoolSection"func(e *colly.HTMLElement) {

    fmt.Println(e.ChildText("p"))

})

123

我们可以使用ForEach()循环遍历一个孩子匹配一个特定的元素选择器:

1

2

3

4

5

6

7

8

c.OnHTML("#myCoolSection"func(e *colly.HTMLElement) {

    e.ForEach("p"func(_ int, elem *colly.HTMLElement) {

        if strings.Contains(elem.Text, "golang") {

            fmt.Println(elem.Text)

        }   

    })

})

1234567

五、Bringing in Goquery

Colly的内置HTMLElement对大多数抓取任务都很有用,但是如果我们想对DOM进行特别高级的遍历,我们就必须去别处寻找。 例如,(目前)没有办法将DOM遍历到父元素或通过兄弟元素横向遍历。
输入Goquery,“就像那个j-thing,只在Go中”。 它基本上是jQuery。 在Go。 (这很棒)对于你想从HTML文档中删除的任何内容,可以使用Goquery完成。
虽然Goquery是以jQuery为模型的,但我发现它在很多方面与BeautifulSoup API非常相似。 所以,如果你来自Python抓取世界,那么你可能会对Goquery感到满意。
Goquery允许我们进行比Colly的HTMLElement提供的更复杂的HTML选择和DOM遍历。 例如,我们可能想要找到我们的锚元素的兄弟元素,以获得我们已经抓取的链接的一些上下文:

1

2

3

4

5

dom, _ := qoquery.NewDocument(htmlData)

dom.Find("a").Siblings().Each(func(i int, s *goquery.Selection) {

    fmt.Printf("%d, Sibling text: %s\n", i, s.Text())

})

1234

此外,我们可以轻松找到所选元素的父级。 如果我们从Colly给出一个锚标记,并且我们想要找到页面

1

2

anchor.ParentsUntil("~").Find("title").Text()

1

ParentsUntil遍历DOM,直到找到与传递的选择器匹配的东西。 我们可以使用〜遍历DOM的顶部,然后允许我们轻松获取标题标记。
这实际上只是抓住了Goquery可以做的事情。 到目前为止,我们已经看到了DOM遍历的示例,但Goquery也对DOM操作提供了强大的支持 - 编辑文本,添加/删除类或属性,插入/删除HTML元素等。
将它带回网络抓取,我们如何将Goquery与Colly一起使用? 它很简单:每个Colly HTMLElement都包含一个Goquery选项,您可以通过DOM属性访问它。

1

2

3

4

5

6

7

8

c.OnHTML("div"func(e *colly.HTMLElement) {

    // Goquery selection of the HTMLElement is in e.DOM

    goquerySelection := e.DOM

 

    // Example Goquery usage

    fmt.Println(qoquerySelection.Find(" span").Children().Text())

})

1234567

值得注意的是,大多数抓取任务都可以以不需要使用Goquery的方式构建! 只需为html添加一个OnHTML回调,就可以通过这种方式访问整个页面。 但是,我仍然发现Goquery是我的DOM遍历工具带的一个很好的补充。

实战项目

1. metalsucks专辑评论排名信息

  • 代码

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

// go get github.com/PuerkitoBio/goquery

// git clone  https://github.com/golang/net

 

package main

 

import (

  "fmt"

  "log"

  "net/http"

 

  "github.com/PuerkitoBio/goquery"

)

 

func main() {

  // 请求html页面

  res, err := http.Get("http://metalsucks.net")

  if err != nil {

      // 错误处理

      log.Fatal(err)

  }

  defer res.Body.Close()

  if res.StatusCode != 200 {

      log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)

  }

  // 加载 HTML document对象

  doc, err := goquery.NewDocumentFromReader(res.Body)

  if err != nil {

      log.Fatal(err)

  }

  // Find the review items

  doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {

      // For each item found, get the band and title

      band := s.Find("a").Text()

      title := s.Find("i").Text()

      fmt.Printf("Review %d: %s - %s\n", i, band, title)

  })

}

  • 输出

    Review 0: Darkthrone - Old Star
    Review 1: Baroness - Gold & Grey
    Review 2: Death Angel - Humanicide
    Review 3: Devin Townsend - Empath
    Review 4: Whitechapel - The Valley
    

2. emojipedia表情抓取(colly + goquery)

  • 代码

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

package main

 

import (

  "fmt"

  "strings"

  "time"

 

  "github.com/PuerkitoBio/goquery"

  "github.com/gocolly/colly"

)

 

func main() {

  c := colly.NewCollector(

      colly.AllowedDomains("emojipedia.org"),

  )

 

  // Callback for when a scraped page contains an article element

  c.OnHTML("article"func(e *colly.HTMLElement) {

      isEmojiPage := false

  // Extract meta tags from the document

  metaTags := e.DOM.ParentsUntil("~").Find("meta")

  metaTags.Each(func(_ int, s *goquery.Selection) {

      // Search for og:type meta tags

      property, _ := s.Attr("property")

      if strings.EqualFold(property, "og:type") {

          content, _ := s.Attr("content")

 

          // Emoji pages have "article" as their og:type

          isEmojiPage = strings.EqualFold(content, "article")

      }

  })

 

  if isEmojiPage {

      // Find the emoji page title

      fmt.Println("Emoji: ", e.DOM.Find("h1").Text())

      // Grab all the text from the emoji's description

      fmt.Println(

          "Description: ",

          e.DOM.Find(".description").Find("p").Text())

  }

  })

 

  // Callback for links on scraped pages

  c.OnHTML("a[href]"func(e *colly.HTMLElement) {

      // Extract the linked URL from the anchor tag

      link := e.Attr("href")

      // Have our crawler visit the linked URL

      c.Visit(e.Request.AbsoluteURL(link))

  })

 

  c.Limit(&colly.LimitRule{

      DomainGlob:  "*",

      RandomDelay: 1 * time.Second,

  })

 

  c.OnRequest(func(r *colly.Request) {

      fmt.Println("Visiting", r.URL.String())

  })

 

  c.Visit("https://emojipedia.org")

}

  • 运行结果

3.校花网图片爬取

  • 代码

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

// 知识点

// 1. http 的用法,返回数据的格式、编码

// 2. 正则表达式

// 3. 文件读写

package main

 

import (

    "bytes"

    "fmt"

    "io/ioutil"

    "net/http"

    "os"

    "path/filepath"

    "regexp"

    "strings"

    "sync"

    "time"

 

    "github.com/axgle/mahonia"

)

 

var workResultLock sync.WaitGroup

 

func check(e error) {

    if e != nil {

        panic(e)

    }

}

 

func ConvertToString(src string, srcCode string, tagCode string) string {

    srcCoder := mahonia.NewDecoder(srcCode)

    srcResult := srcCoder.ConvertString(src)

    tagCoder := mahonia.NewDecoder(tagCode)

    _, cdata, _ := tagCoder.Translate([]byte(srcResult), true)

    result := string(cdata)

    return result

}

 

func download_img(request_url string, name string, dir_path string) {

    image, err := http.Get(request_url)

    check(err)

    image_byte, err := ioutil.ReadAll(image.Body)

    defer image.Body.Close()

    file_path := filepath.Join(dir_path, name+".jpg")

    err = ioutil.WriteFile(file_path, image_byte, 0644)

    check(err)

    fmt.Println(request_url + "\t下载成功")

}

 

func spider(i int, dir_path string) {

    defer workResultLock.Done()

    url := fmt.Sprintf("http://www.xiaohuar.com/list-1-%d.html", i)

    response, err2 := http.Get(url)

    check(err2)

    content, err3 := ioutil.ReadAll(response.Body)

    check(err3)

    defer response.Body.Close()

    html := string(content)

    html = ConvertToString(html, "gbk""utf-8")

    // fmt.Println(html)

    match := regexp.MustCompile(`<img width="210".*alt="(.*?)".*src="(.*?)" />`)

    matched_str := match.FindAllString(html, -1)

    for _, match_str := range matched_str {

        var img_url string

        name := match.FindStringSubmatch(match_str)[1]

        src := match.FindStringSubmatch(match_str)[2]

        if strings.HasPrefix(src, "http") != true {

            var buffer bytes.Buffer

            buffer.WriteString("http://www.xiaohuar.com")

            buffer.WriteString(src)

            img_url = buffer.String()

        else {

            img_url = src

        }

        download_img(img_url, name, dir_path)

    }

}

 

func main() {

    start := time.Now()

    dir := filepath.Dir(os.Args[0])

    dir_path := filepath.Join(dir, "images")

    err1 := os.MkdirAll(dir_path, os.ModePerm)

    check(err1)

    for i := 0; i < 4; i++ {

        workResultLock.Add(1)

        go spider(i, dir_path)

    }

    workResultLock.Wait()

    fmt.Println(time.Now().Sub(start))

}

 

  • 运行结果

  • 下载的图片

作者:张亚飞 
出处:https://www.cnblogs.com/zhangyafei 
gitee:https://gitee.com/zhangyafeii 
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。

标签: Go之路

Guess you like

Origin blog.csdn.net/hsu282/article/details/110951227