Golang uses the channel's synchronous wait group WaitGroup to develop concurrent crawlers

There is a classic saying in Go's concurrent programming: Don't use shared memory to communicate, but use shared communication to share memory.

The Go language does not encourage the use of locks to protect the shared state in different Goroutines to share information (to communicate in a shared memory). Rather, it encourages the transfer of shared state or shared state changes between each Goroutine through channels (to share memory by communication), which can also ensure that only one Goroutine accesses the shared state at the same time as a lock.

Of course, in the mainstream programming languages, in order to ensure the security and consistency of data sharing between multiple threads, a basic set of synchronization tools will be provided, such as locks, condition variables, atomic operations and so on. The Go language standard library also provides these synchronization mechanisms without a surprise, and the usage is similar to other languages.

 

 

WaitGroup

WaitGroup, synchronization waiting group.

In terms of type, it is a structure. The purpose of a WaitGroup is to wait for the execution of a goroutine collection to complete. The main goroutine calls the Add () method to set the number of goroutines to wait. Then, each goroutine will be executed and the Done () method will be called after the execution is completed. At the same time, you can use the Wait () method to block until all goroutines are executed.

Add () method

Add this method is used to set the counter value to WaitGroup. We can understand that there is a counter in each waitgroup to indicate the number of goroutin to be executed in this synchronization waiting group.

If the value of the counter becomes 0, it means that the blocked goroutines are released while waiting. If the value of the counter is negative, it will cause panic and the program will report an error.

Done () method

The Done () method is to set the counter value of this WaitGroup minus 1 when WaitGroup synchronously waits for the execution of a goroutine in the group.

Wait () method

The Wait () method means to let the current goroutine wait and enter the blocking state. Until the counter of WaitGroup is zero. In order to unblock, the goroutine can continue to execute.

Sample code

 
  1. package main

    import (
        "fmt"
        "sync"
    )
    var wg sync.WaitGroup // 创建同步等待组对象
    func main()  {
        /*
        WaitGroup:同步等待组
            可以使用Add(),设置等待组中要 执行的子goroutine的数量,

            在main 函数中,使用wait(),让主程序处于等待状态。直到等待组中子程序执行完毕。解除阻塞

            子gorotuine对应的函数中。wg.Done(),用于让等待组中的子程序的数量减1
         */
        //设置等待组中,要执行的goroutine的数量
        wg.Add(2)
        go fun1()
        go fun2()
        fmt.Println("main进入阻塞状态。。。等待wg中的子goroutine结束。。")
        wg.Wait() //表示main goroutine进入等待,意味着阻塞
        fmt.Println("main,解除阻塞。。")

    }
    func fun1()  {
        for i:=1;i<=10;i++{
            fmt.Println("fun1.。。i:",i)
        }
        wg.Done() //给wg等待中的执行的goroutine数量减1.同Add(-1)
    }
    func fun2()  {
        defer wg.Done()
        for j:=1;j<=10;j++{
            fmt.Println("\tfun2..j,",j)
        }
    }

channel channel

The channel can be considered as a communication channel for Goroutines. Similar to the flow of water in a pipe from one end to the other, data can be sent from one end to the other and received through the channel.

When we talked about the concurrency of the Go language earlier, we said that when multiple Goroutines want to implement shared data, although they also provide a traditional synchronization mechanism, the Go language strongly recommends the use of Channel channels to achieve between Goroutines. Communication.

"Don't communicate through shared memory, but share memory through communication" This is a classic phrase that popularized the golang community

Receive and send

A channel to send and receive data is blocked by default. When a piece of data is sent to a channel, it is blocked in the send statement until another Goroutine reads data from the channel. In contrast, when reading data from a channel, the read is blocked until a Goroutine writes the data to the channel.

Sample code: The following code adds sleep, which can better understand the channel blocking

 
  1. package main

    import (
        "fmt"
        "time"
    )

    func main() {
        ch1 := make(chan int)
        done := make(chan bool) // 通道
        go func() {
            fmt.Println("子goroutine执行。。。")
            time.Sleep(3 * time.Second)
            data := <-ch1 // 从通道中读取数据
            fmt.Println("data:", data)
            done <- true
        }()
        // 向通道中写数据。。
        time.Sleep(5 * time.Second)
        ch1 <- 100

        <-done
        fmt.Println("main。。over")

    }

In the above program, we first created a chan bool channel. Then I started a sub-Goroutine and printed 10 numbers in a loop. Then we write input true to channel ch1.
Then in the main goroutine, we read data from ch1. This line of code is blocked, which means that the main goroutine will not execute to the next line of code until the child Goroutine writes data to the channel.

Therefore, we can realize the communication between the sub-goroutine and the main goroutine through the channel. When the child goroutine is executed, the main goroutine will be blocked by reading the data in ch1. This ensures that the sub-goroutine will be executed first. This eliminates the need for time.

In the previous program, we either put the main goroutine to sleep to prevent the main goroutine from exiting. Either use WaitGroup to ensure that the sub-goroutine is executed before the main goroutine ends.

Deadlock

An important factor to consider when using channels is deadlock. If Goroutine sends data on one channel, then it is expected that other Goroutines should receive data. If this does not happen, then the program will deadlock while running.

Similarly, if Goroutine is waiting to receive data from the channel, then some Goroutine will write data on the channel, otherwise the program will deadlock.

Sample code

 
  1. package main

    func main() {  
        ch := make(chan int)
        ch <- 5
    }

Error:

 
  1. fatal error: all goroutines are asleep - deadlock!

    goroutine 1 [chan send]:
    main.main()
        /Users/ruby/go/src/l_goroutine/demo08_chan.go:5 +0x50

Goroutine

Goroutine is an entity that actually executes concurrently. Its bottom layer uses coroutine to achieve concurrency. Coroutine is a user thread running in user mode. Similar to greenthread, the starting point of go bottom layer to choose coroutine is because it has the following characteristics :

User space avoids the cost caused by the switch between kernel mode and user mode. It
can be scheduled by the language and framework layers. The
smaller stack space allows a large number of instances to be created.

Goroutine scheduler

Go concurrent scheduling: GPM model

On top of the kernel threads provided by the operating system, Go has built a unique two-level threading model. The goroutine mechanism implements the threading model of M: N. The goroutine mechanism is an implementation of coroutines. Golang's built-in scheduler allows each CPU in a multi-core CPU to execute a coroutine.

The above content is from  https://github.com/rubyhan1314/Golang-100-Days
mainly explains the basic use of synchronization wait groups and channels, and how go handles concurrency. For more, you can continue to refer to the above, go from Qianfeng Tutorial.

Combat reptile

I have said so much before, but this is just for the preparation of this script, otherwise it would be too abrupt.
I wrote a crawler script here, using the channel to do concurrency, and there are synchronous waiting groups to do awit ()

Look directly at the code

Get html

 
  1. func HttpGet(url string) (result string, err error) {
        resp, err1 := http.Get(url)
        if err != nil {
            err = err1
            return
        }
        defer resp.Body.Close()
        //读取网页的body内容
        buf := make([]byte, 4*1024)
        for true {
            n, err := resp.Body.Read(buf)
            if err != nil {
                if err == io.EOF{
                    break
                }else {
                    fmt.Println("resp.Body.Read err = ", err)
                    break
                }
            }
            result += string(buf[:n])
        }
        return
    }

Crawl web pages as .html files

 
  1. func spiderPage(url string) string {

        fmt.Println("正在爬取", url)
        //爬,将所有的网页内容爬取下来
        result, err := HttpGet(url)
        if err != nil {
            fmt.Println(err)
        }
        //把内容写入到文件
        filename := strconv.Itoa(rand.Int()) + ".html"
        f, err1 := os.Create(filename)
        if err1 != nil{
            fmt.Println(err1)
        }
        //写内容
        f.WriteString(result)
        //关闭文件
        f.Close()
        return url + " 抓取成功"

    }

I have finished writing the crawling method, and then I have reached the important part

Define a worker function

 
  1. func doWork(start, end int,wg *sync.WaitGroup) {
        fmt.Printf("正在爬取第%d页到%d页\n", start, end)
        //因为很有可能爬虫还没有结束下面的循环就已经结束了,所以这里就需要且到通道
        page := make(chan string,100)
        results := make(chan string,100)


        go sendResult(results,start,end)

        go func() {

            for i := 0; i <= 20; i++ {
                wg.Add(1)
                go asyn_worker(page, results, wg)
            }
        }()

        for i := start; i <= end; i++ {
                url := "https://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf-8&pn=" + strconv.Itoa((i-1)*50)
                page <- url
                println("加入" + url + "到page")
            }
            println("关闭通道")
            close(page)

        wg.Wait()
        //time.Sleep(time.Second * 5)
        println(" Main 退出 。。。。。")
    }

Get data from the channel

 
  1. func asyn_worker(page chan string, results chan string,wg *sync.WaitGroup){

        defer wg.Done()  //defer wg.Done()必须放在go并发函数内

        for{
            v, ok := <- page //显示的调用close方法关闭通道。
            if !ok{
                fmt.Println("已经读取了所有的数据,", ok)
                break
            }
            //fmt.Println("取出数据:",v, ok)
            results <- spiderPage(v)
        }


        //for n := range page {
        //  results <- spiderPage(n)
        //}
    }

Send crawl results

 
  1. func sendResult(results chan string,start,end int)  {

        //for i := start; i <= end; i++ {
        //  fmt.Println(<-results)
        //}

        // 发送抓取结果
        for{
            v, ok := <- results
            if !ok{
                fmt.Println("已经读取了所有的数据,", ok)
                break
            }
            fmt.Println(v)

        }
    }

The general idea is this:

It can be seen that I have defined two channels, one is used to store the URL, the other is used to store the crawling result, the buffer space is 100.
In the method doWork, sendResult will block waiting for the output of the results channel, anonymous The function is to wait for the output of the page channel

Immediately following is to write 200 URLs to the page channel. The anonymous function gets the output of the page and executes the asyn_worker function, which is the function of crawling html (store it in the results channel)

Then the sendResult function gets the output of the results channel and prints the result

It can be seen that I have 20 goroutions concurrently in the anonymous function, and the synchronization waiting group is enabled as a parameter. In theory, the number of concurrency can be defined according to the performance of the machine.

main function

 
  1. func main() {
        start_time := time.Now().UnixNano()

        var wg sync.WaitGroup

        doWork(1,200, &wg)
        //输出执行时间,单位为毫秒。
        fmt.Printf("执行时间: %ds",(time.Now().UnixNano() - start_time) / 1000)

    }

Run the crawler and calculate the running time, this time varies from machine to machine, but it should not be much different

Complete code

 
  1. package main

    import (
        "fmt"
        "io"
        "sync"
        "math/rand"
        "net/http"
        "os"
        "strconv"
        "time"
    )



    func HttpGet(url string) (result string, err error) {
        resp, err1 := http.Get(url)
        if err != nil {
            err = err1
            return
        }
        defer resp.Body.Close()
        //读取网页的body内容
        buf := make([]byte, 4*1024)
        for true {
            n, err := resp.Body.Read(buf)
            if err != nil {
                if err == io.EOF{
                    break
                }else {
                    fmt.Println("resp.Body.Read err = ", err)
                    break
                }
            }
            result += string(buf[:n])
        }
        return
    }


    //爬取网页
    func spiderPage(url string) string {

        fmt.Println("正在爬取", url)
        //爬,将所有的网页内容爬取下来
        result, err := HttpGet(url)
        if err != nil {
            fmt.Println(err)
        }
        //把内容写入到文件
        filename := strconv.Itoa(rand.Int()) + ".html"
        f, err1 := os.Create(filename)
        if err1 != nil{
            fmt.Println(err1)
        }
        //写内容
        f.WriteString(result)
        //关闭文件
        f.Close()
        return url + " 抓取成功"

    }

    func asyn_worker(page chan string, results chan string,wg *sync.WaitGroup){

        defer wg.Done()  //defer wg.Done()必须放在go并发函数内

        for{
            v, ok := <- page //显示的调用close方法关闭通道。
            if !ok{
                fmt.Println("已经读取了所有的数据,", ok)
                break
            }
            //fmt.Println("取出数据:",v, ok)
            results <- spiderPage(v)
        }

        //for n := range page {
        //  results <- spiderPage(n)
        //}
    }

    func doWork(start, end int,wg *sync.WaitGroup) {
        fmt.Printf("正在爬取第%d页到%d页\n", start, end)
        //因为很有可能爬虫还没有结束下面的循环就已经结束了,所以这里就需要且到通道
        page := make(chan string,100)
        results := make(chan string,100)


        go sendResult(results,start,end)

        go func() {

            for i := 0; i <= 20; i++ {
                wg.Add(1)
                go asyn_worker(page, results, wg)
            }
        }()


        for i := start; i <= end; i++ {
                url := "https://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&ie=utf-8&pn=" + strconv.Itoa((i-1)*50)
                page <- url
                println("加入" + url + "到page")
            }
            println("关闭通道")
            close(page)

        wg.Wait()
        //time.Sleep(time.Second * 5)
        println(" Main 退出 。。。。。")
    }


    func sendResult(results chan string,start,end int)  {

        //for i := start; i <= end; i++ {
        //  fmt.Println(<-results)
        //}

        // 发送抓取结果
        for{
            v, ok := <- results
            if !ok{
                fmt.Println("已经读取了所有的数据,", ok)
                break
            }
            fmt.Println(v)

        }
    }

    func main() {
        start_time := time.Now().UnixNano()

        var wg sync.WaitGroup

        doWork(1,200, &wg)
        //输出执行时间,单位为毫秒。
        fmt.Printf("执行时间: %ds",(time.Now().UnixNano() - start_time) / 1000)

    }

In general, this script is to clarify the concurrency principle and channel of Go language, the basic use of synchronization waiting group, or only use the lock of go language, the purpose is to prevent the security problem of critical resources.

With channel and goroutine, Go's concurrent programming has become extremely easy and safe, allowing programmers to focus their attention on business and improve development efficiency.

Go to https://gzky.live/article/Golang%E9%80%9A%E9%81%93%E5%90%8C%E6%AD%A5%E7%AD%89%E5%BE%85% E7% BB% 84% 20% E5% B9% B6% E5% 8F% 91% E7% 88% AC% E8% 99% AB

Published 23 original articles · won praise 2 · Views 5236

Guess you like

Origin blog.csdn.net/bianlitongcn/article/details/105367100