First, goQuery library:
1. Brief description
According to the official description, goQuery implements DOM manipulation functions similar to jQuery. Unlike jQuery, jQuery returns a complete DOM tree, and goQuery returns a DOM node. The bottom layer of goQuery is implemented by the golang standard library net / html. The parser requires that the document must be UTF-8 encoded. Users should convert the document encoding as needed. goQuery-readme
2. The main method
2.1 Document: returns the HTML document to be operated
// Document represents an HTML document to be manipulated. Unlike jQuery, which
// is loaded as part of a DOM document, and thus acts upon its containing
// document, GoQuery doesn't know which HTML document to act upon. So it needs
// to be told, and that's what the Document class is for. It holds the root
// document node to manipulate, and can make selections on this document.
type Document struct {
*Selection
Url *url.URL
rootNode *html.Node
}
2.2 Selection: Nodes that meet the specified conditions.
// Selection represents a collection of nodes matching some criteria. The
// initial Selection can be created by using Document.Find, and then
// manipulated using the jQuery-like chainable syntax and methods.
type Selection struct {
Nodes []*html.Node
document *Document
prevSel *Selection
}
2.3 Document operation function:
Eq()
Index()
Last()
Slice()
Get()
······
2. Collection Agent IP:
1. Proxy IP Pool
2.goQuery collection:
Not much to say, the code:
import "github.com/PuerkitoBio/goquery"
Import goquery library
//采集代理返回的参数
type proxyResult struct {
Ip string `json:"ip"` //ip
Port int `json:port` //端口
Agreement string `json:agreement` //请求协议
Anonymous string `json:anonymous` //透明度
Region string `json:region` //地区
Speed string `json:"speed"` //响应速度
Source string `json:"source"` //来源(采集资源站)
Verification string `json:"verification"` //验证时间
}
//采集代理所需的参数
type proxyParamet struct {
ipIndex int `json:"ipIndex"` //ip下标
portIndex int `json:"portIndex"` //端口下标
agreementIndex int `json:"agreementIndex"` //请求协议下标
anonymousIndex int `json:"anonymousIndex"` //透明度下标
regionIndex int `json:"regionIndex"` //地区下标
speedIndex int `json:"speedIndex"` //响应速度下标
sourceIndex int `json:"sourceIndex"` //来源(采集资源站)下标
verificationIndex int `json:"verificationIndex"` //验证时间下标
}
Create two structures and proxyResult
save the returned data proxyParamet
to carry the incoming parameters
func CollectionResources(targetUrl string, parame proxyParamet)[]proxyResult {
//用来存储采集结果
proxyList := []proxyResult{}
//请求目标站点方式(GET/POST)
resp, err := http.Get(targetUrl)
//请求失败则输出日志
if resp.StatusCode != 200 || err != nil {
log.Pr("spider", "请求出错", err)
}
//添加随机User-Agent
resp.Header.Add("User-Agent", random.RandomUseragent())
//返回目标站html文档
doc, err := goquery.NewDocumentFromReader(resp.Body)
//因为大部分都是以表格方式展示,所以这里就直接抓取tbody的内容
doc.Find("tbody tr").Each(func(i int, selection *goquery.Selection) {
//golang不支持自动类型转换,这里手动转换拼接
ip := selection.Find("td:nth-child(" + strconv.Itoa(parame.ipIndex) + ")").Text()
port, err := strconv.Atoi(selection.Find("td:nth-child(" + strconv.Itoa(parame.portIndex) + ")").Text())
agreement := selection.Find("td:nth-child(" + strconv.Itoa(parame.agreementIndex) + ")").Text()
anonymous := selection.Find("td:nth-child(" + strconv.Itoa(parame.anonymousIndex) + ")").Text()
region := selection.Find("td:nth-child(" + strconv.Itoa(parame.regionIndex) + ")").Text()
speedString := selection.Find("td:nth-child(" + strconv.Itoa(parame.speedIndex) + ")").Text()
//有的代理池会携带上单位:秒,加上这段代码可以去掉。可加可不加,按需设置
speed := strings.TrimRight(speedString, "秒")
verification := selection.Find("td:nth-child(" + strconv.Itoa(parame.verificationIndex) + ")").Text()
if err != nil {
log.Pr("spider", "数据转换出错", err)
}
proxyList = append(proxyList, proxyResult{Ip: ip,
Port: port,
Agreement: agreement,
Anonymous: anonymous,
Region: region,
Speed: speed,
Source: targetUrl,
Verification: verification}, )
})
//采集完毕之后清理
defer resp.Body.Close()
return proxyList
}
transfer:
parame := proxyParamet{
ipIndex: 1,
portIndex: 2,
agreementIndex: 4,
anonymousIndex: 3,
regionIndex: 5,
speedIndex: 6,
verificationIndex: 7,
}
CollectionResources("https://www.kuaidaili.com/free/", parame)
The parame parameter corresponds to the subscript of the table title (starting from 1)
Extension: set proxy
Now that all IPs have been collected, why not add a request proxy?
Not much to say, code
func StartRequestProxy(address string, proxyIpInfo [] proxyResult) string {
proxyAddr := "协议+地址+端口 /如:https://127.0.0.1:8080"
url := address
cli := newHttpClient(proxyAddr)
data, _ := httpGET(cli, url)
return string(data)
}
func newHttpClient(proxyAddr string) *http.Client {
proxy, err := url.Parse(proxyAddr)
if err != nil {
return nil
}
netTransport := &http.Transport{
Proxy: http.ProxyURL(proxy),
Dial: func(netw, addr string) (net.Conn, error) {
c, err := net.DialTimeout(netw, addr, time.Second*time.Duration(10))
if err != nil {
return nil, err
}
return c, nil
},
MaxIdleConnsPerHost: 10, //每个host最大空闲连接
ResponseHeaderTimeout: time.Second * time.Duration(5), //数据收发5秒超时
}
return &http.Client{
Timeout: time.Second * 10,
Transport: netTransport,
}
}
func httpGET(client *http.Client, url string) (body []byte, err error) {
rsp, err := client.Get(url)
if err != nil {
return
}
defer rsp.Body.Close()
if rsp.StatusCode != http.StatusOK || err != nil {
err = fmt.Errorf("HTTP GET Code=%v, URI=%v, err=%v", rsp.StatusCode, url, err)
log.Pr("HttpGet", "Request error", err)
return
}
return ioutil.ReadAll(rsp.Body)
}
Complete source code : GitHub / ipProxy
Note: part of the introduction to goquery refers to the go language Chinese network