Go language self-study notes (eight)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_18800771/article/details/97667637

HTTP programming:

Web works: the HTTP protocol

For ordinary Internet process, the system uses the operating procedures: the browser itself is a client, when you enter the URL, the first browser to go back to request the DNS server to get the appropriate domain names and corresponding IP through DNS, then the IP address after locating the server corresponding IP, requires the establishment of a TCP connection, waiting for the browser after sending HTTP request (request) packet, the server receives a request packet starts processing request packet, the server calls itself service, returns HTTP response (response) packet; client after receiving the response from the server response begin rendering the bag body (body), and the like all of the received content is then disconnected between the TCP server.

Also referred to as a Web server HTTP server, the HTTP protocol to communicate with its clients. The client usually refers to a Web browser (internal phone browser client is implemented).

Works Web servers can be simply summarized as follows:

1. The client establishes to the server via TCP / IP protocol TCP connection

2. The client sends a request packet to the server HTTP protocol, server resources in the document request

3. The server sends to the client HTTP protocol response packet, if the pleading resource package contains dynamic language content, the server calls the interpretation engine handles "dynamic content" dynamic languages, and processing the resulting data back to the client

4. The client disconnects from the server. By the client interprets the HTML document, the result of rendering graphics on the client screen

HTTP protocol:

Hypertext Transfer Protocol (HTTP, HyperText Transfer Protocol) is the Internet's most widely used network protocol, which sets out detailed rules of mutual communication between the browser and server interesting, the World Wide Web document transfer protocol to transfer data via the Internet.

HTTP protocol is usually carried over the TCP protocol, sometimes carried on the TLS or SSL protocol layer, this time, it becomes that we often say HTTPS (encrypted).

Address (URL, Uniform Resource Locator): used to represent network resources, it can be understood as a network file path.

URL format is as follows:

http://host[":"port][abs_path]
http://192.168.31.1/html/index    协议:// 主机:端口 / 路径

URL length is limited, the limit value is not the same of different servers, but not infinite.

Request packet and response packet:

The client sends a request to the server is actually a request packet, the server replies to the client is the response packet. (HTTP)

Analysis HTTP server request packet format: the package needs to import net

package main
import (
	"fmt"
	"net"
)
func main() {
	//监听
	listener, err := net.Listen("tcp", ":8000")
	if err != nil {
		fmt.Println("net.Listen err = ", err)
		return
	}
	defer listener.Close()
	//阻塞等待用户链接
	conn, err1 := listener.Accept()
	if err1 != nil {
		fmt.Println("listener.Accept err = ", err1)
		return
	}
	defer conn.Close()
	//接受用户的请求
	buf := make([]byte, 1024*4)
	n, err2 := conn.Read(buf)
	if n == 0 {
		fmt.Println("conn.Read err = ", err2)
		return
	}
	fmt.Printf("#%v#", string(buf[:n]))
}

User access 127.0.0.1:8000, you can obtain and print server request packet:

Request message format: HTTP request message from the request line, request headers, a blank line, the request packet portion 4 Body Composition

#GET / HTTP/1.1		//此行为请求行,GET请求方式
Host: 127.0.0.1:8000	//此处一直到末尾空行之前为请求头
Connection: keep-alive		//带有:的都属于键值对。每行末尾都是\r\n
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
	//空行
#	//包体(暂无)

1. Request Line:

Request line by the process field, URL field, and a protocol version field HTTP composed of three parts, separated by spaces between them. Common HTTP request method is GET, POST.

The method of the request are the following:

GET: get a document from the server
HEAD: header only get the document from the server
POST: send to the server data to be processed, commonly filed forms
PUT: the requested body part is stored on the server, send a document from the server to the client
tRACE: to transfer to the server may go through a proxy server packet tracing
OPTIONS: decide which method can be performed on the server
dELETE: delete a document from the server

GET:

1). When the client from the server to read a resource, use the GET method. GET method requires the server to locate resources on the URL portion of the response data packet back to the client to the server also requests a resource.

2) When using the GET method, the request parameters and the corresponding value is appended URL, using a question mark ( "?") With the beginning and end request parameters representing the URL, the length of transmission parameters is limited, and therefore are not suitable for GEY upload data.

3) When you get to the page by the GET method, parameter is displayed in the browser address bar, a confidentiality poor.

POST:

1).当客户端给服务器提供信息较多时可以使用POST方法,POST方法向服务器提交数据,比如完成表单数据的提交,将数据提交给服务器处理。

2).GET一般用于获取/查询资源信息,POST会附带用户数据,一般用于更新资源信息。POST方法将请求参数封装在HTTP请求数据中,而且长度没有限制,因为POST携带的数据,在HTTP请求正文中,以名称/值的形式出现,可以传输大量数据。

2.请求头部:

请求头部为请求包未添加了一些附加信息,由“名/值”对组成,每行一对,名和值之间使用冒号分隔。

请求头部通知服务器有关于客户端的信息,经典的请求头有:

3.空行:

最后一个请求头之后是一个空行,发送回车符和换行符,通知服务器一下不再有请求头。

4.请求包体:

请求包体不再GET方法中使用,而是在POST方法中使用。

POST方法适用于需要客户填写表单的场合。与请求包体相关的最常用的是包体类型Content-Type和包体长度Content-Length。

通过网址访问资源,如:127.0.0.1:8000/mike.html,服务器会接收到被请求的资源(#GET /mike.html  HTTP/1.1)

客户端响应报文格式分析(组串发请求包):需要导入包net

测试服务器:打印hello world

package main
import(
	"fmt"
	"net/http"
)
//服务端编写的业务逻辑处理程序
func myHandler (w http.ResponseWriter, r *http.Request){
	fmt.Fprintln(w, "hello world")
}
func main(){
	http.HandleFunc("/go", myHandler)
	//在指定的地址进行监听,开启一个HTTP
	http.ListenAndServe("127.0.0.1:8000", nil)
}

网址访问:127.0.0.1:8000/go

客户端代码:

package main
import (
	"fmt"
	"net"
)
func main() {
	//主动连接服务器
	conn, err := net.Dial("tcp", ":8000")
	if err != nil {
		fmt.Println("net.Dial err = ", err)
		return
	}
	defer conn.Close()
	requestBuf := "GET /go HTTP/1.1\r\nAccept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/xaml+xml, application/x-ms-xbap, */*\r\nAccept-Language: zh-Hans-CN,zh-Hans;q=0.8,en-US;q=0.5,en;q=0.3\r\nUser-Agent: Mozilla/4.0 (compatoble; MSIE 7.0; Windows NT 10.0; Wow64;Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)\r\nAccept-Encoding: gzip, deflate\r\nHost: 127.0.0.1:8000\r\nConnection: Keep-Alive\r\n\r\n"
	//先发请求包,服务器才会回应包
	conn.Write([]byte(requestBuf))
	//接收服务器回复的响应包
	buf := make([]byte, 1024*4)
	n, err1 := conn.Read(buf)
	if n == 0 {
		fmt.Println("conn.Read err = ", err1)
		return
	}
	//打印响应报文
	fmt.Printf("#%v#", string(buf[:n]))
}

服务器成功响应:

 

服务器响应失败:

响应报文格式:HTTP响应报文由状态行、响应头部、空行、响应包体四个部分组成

#HTTP/1.1 200 OK    //状态行
Date: Tue, 30 Jul 2019 03:34:48 GMT    //响应头部
Content-Length: 12
Content-Type: text/plain; charset=utf-8
    //空行
hello world    //响应包体
#

1.状态行:

状态行由HTTP协议版本字段、状态码和状态吗的描述文本3各部分组成,他们之间使用空格隔开。

状态码:

状态码由三位数字组成,第一位数字表示响应的类型,常用的状态码有五大类:

常见的状态码举例:

2.响应头部:

响应头部可能包括:

3.空行:

最后一个响应头部之后是一个空行,发送回车符和换行符,通知服务器以下不再有响应头部。

4.响应包体:

服务器返回给客户端的文本信息。

HTTP get和post区别:

1.提交:

GET提交,请求的数据会附在URL之后(就是把数据放置在HTTP协议头<request-line>中),以?分割URL和传输数据,多个参数用&连接;例如:login.action?name=hyddd&password=idontknow&verify=%E4%BD%A0 %E5%A5%BD。如果数据是英文字母/数字,原样发送,如果是空格,转换为+,如果是中文/其他字符,则直接把字符串用BASE64加密,得出如:%E4%BD%A0%E5%A5%BD,其中%XX中的XX为该符号以16进制表示的ASCII。

POST提交:把提交的数据放置在是HTTP包的包体<request-body>中。上文示例中红色字体标明的就是实际的传输数据。因此,GET提交的数据会在地址栏中显示出来,而POST提交,地址栏不会改变

2.传输数据的大小:

 首先声明,HTTP协议没有对传输的数据大小进行限制,HTTP协议规范也没有对URL长度进行限制。 而在实际开发中存在的限制主要有:

GET:特定浏览器和服务器对URL长度有限制,例如IE对URL长度的限制是2083字节(2K+35)。对于其他浏览器,如Netscape、FireFox等,理论上没有长度限制,其限制取决于操作系统的支持。因此对于GET提交时,传输数据就会受到URL长度的限制。

POST:由于不是通过URL传值,理论上数据不受限。但实际各个WEB服务器会规定对post提交数据大小进行限制,Apache、IIS6都有各自的配置。

3.安全性:

POST的安全性要比GET的安全性高。

注意:这里所说的安全性和上面GET提到的“安全”不是同个概念。上面“安全”的含义仅仅是不作数据修改,而这里安全的含义是真正的Security的含义,比如:通过GET提交数据,用户名和密码将明文出现在URL上,因为(1)登录页面有可能被浏览器缓存, (2)其他人查看浏览器的历史纪录,那么别人就可以拿到你的账号和密码了。

HTTP编程:

Go语言标准库内建提供了net/http包,涵盖了HTTP客户端和服务端的具体表现。使用net/http包,我们可以很方便地编写HTTP客户端或服务端的程序。

HTTP服务器:

package main
import (
	"net/http"
)
//w, 给客户端恢复数据
//req, 读取客户端发送的数据
func HandConn(w http.ResponseWriter, req *http.Request) {
	w.Write([]byte("hello go")) //给客户端回复数据
}
func main() {
	//注册处理函数,用户连接,自动调用指定的处理函数
	http.HandleFunc("/", HandConn)	//此时通过127.0.0.1:8000访问,引号内若是/mike.html 则需要通过127.0.0.1:8000/mike.html访问,
	//监听绑定
	http.ListenAndServe(":8000", nil)
}

HTTP服务器获取客户端信息:

package main
import (
	"net/http"
	"fmt"
)
//w, 给客户端恢复数据
//r, 读取客户端发送的数据
func HandConn(w http.ResponseWriter, req *http.Request) {
	fmt.Println(r.Method)	//获取客户端请求方法
	fmt.Println(r.URL)	//获取地址
	fmt.Println(r.Herder)	//获取头部信息
	fmt.Println(r.Body)	//body为空
	w.Write([]byte("hello go")) //给客户端回复数据
}
func main() {
	//注册处理函数,用户连接,自动调用指定的处理函数
	http.HandleFunc("/", HandConn)	//引号内若是/mike.html 则需要通过127.0.0.1:8000/mike.html访问
	//监听绑定
	http.ListenAndServe(":8000", nil)
}

HTTP客户端:

package main
import (
	"fmt"
	"net/http"
)
func main() {
	resp, err := http.Get("http://www.baidu.com") //返回的resp是一个结构体,可以在studygolang.com/pkgdoc查看详细内容.百度内容过多可测http://127.0.0.1:8000
	if err != nil {
		fmt.Println("http.Get err = ", err)
		return
	}
	defer resp.Body.Close()
	fmt.Println("Status = ", resp.Status)         //打印状态
	fmt.Println("StatusCode = ", resp.StatusCode) //打印状态码
	fmt.Println("Header = ", resp.Header)         //打印头部信息
	fmt.Println("Body = ", resp.Body)             //打印Body(io流)
	//读取ioBody
	buf := make([]byte, 4*1024)
	var tmp string
	for {
		n, err := resp.Body.Read(buf)
		if n == 0 {
			fmt.Println("Read err = ", err)
			break
		}
		tmp += string(buf[:n])
	}
	fmt.Println("tmp = ", tmp)
}

HTTP爬虫:

爬虫的四个主要步骤:

1.明确目标:要知道在哪个范围或网站搜索内容

2.爬:将所有的网站内容全部爬下来

3.取:过滤掉没有用的数据

4.处理数据:以想要的方式储存和使用

爬虫案例:百度贴吧

以GO语言吧为例:

1.明确目标

第一页:http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=0

第二页:http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=50

第三页:http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=100

可知每向下翻一页末尾的数字就会+50(第一页从0开始)

(先不做数据过滤)

单任务爬虫:

package main
import (
	"fmt"
	"net/http"
	"os"
	"strconv" //转换
)
//爬取网页内容
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url)
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页body内容
	buf := make([]byte, 1024*4)
	for {
		n, err := resp.Body.Read(buf)
		if n == 0 {
			fmt.Println("resp.Body.Read err = ", err)
			break
		}
		result += string(buf[:n])
	}
	return
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面数据\n", start, end)

	//明确目标(要知道在哪个范围或网站搜索内容)
	//http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=0 下一页+50
	for i := start; i <= end; i++ {
		url := "http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=" + strconv.Itoa((i-1)*50)
		fmt.Println("url = ", url)
		//爬(将所有的网站内容全部爬下来)
		result, err := HttpGet(url)
		if err != nil {
			fmt.Println("HttpGet err = ", err)
			continue
		}
		//把内容写入到文件
		fileName := strconv.Itoa(i) + ".html"
		f, err1 := os.Create(fileName)
		if err1 != nil {
			fmt.Println("os.Create err = ", err1)
			continue
		}
		f.WriteString(result) //写入内容
		f.Close()             //关闭文件
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end)
}

并发任务爬虫:并发时间优势强烈体现

package main
import (
	"fmt"
	"net/http"
	"os"
	"strconv" //转换
)
//爬取网页内容
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url)
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页body内容
	buf := make([]byte, 1024*4)
	for {
		n, _ := resp.Body.Read(buf)
		if n == 0 {
			break
		}
		result += string(buf[:n])
	}
	return
}
//爬取网页
func SpiderPape(i int, page chan<- int) {
	url := "http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=" + strconv.Itoa((i-1)*50)
	fmt.Printf("正在爬取网页%d:%s\n", i, url)
	//爬(将所有的网站内容全部爬下来)
	result, err := HttpGet(url)
	if err != nil {
		fmt.Println("HttpGet err = ", err)
		return
	}
	//把内容写入到文件
	fileName := strconv.Itoa(i) + ".html"
	f, err1 := os.Create(fileName)
	if err1 != nil {
		fmt.Println("os.Create err = ", err1)
		return
	}
	f.WriteString(result) //写入内容
	f.Close()             //关闭文件
	page <- i
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面数据\n", start, end)
	page := make(chan int)
	//明确目标(要知道在哪个范围或网站搜索内容)
	//http://tieba.baidu.com/f?kw=go%E8%AF%AD%E8%A8%80&ie=utf-8&pn=0 下一页+50
	for i := start; i <= end; i++ {
		go SpiderPape(i, page)
	}
	for i := start; i <= end; i++ {
		fmt.Printf("页面%d爬取完成\n", <-page)
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end)
}

爬虫案例:段子

以捧腹网为例:

1.明确目标:

第一页:https://www.pengfu.com/xiaohua_1.html

第二页:https://www.pengfu.com/xiaohua_2.html

第三页:https://www.pengfu.com/xiaohua_3.html

可知每向下翻一页,末尾数字+1

主页规律:

通过查看网页源代码我们可以发现每一个段子的标题格式:

<h1 class="dp-b"><a href="https://www.pengfu.com/content_1857787_1.html" target="_blank">系统维护的时候</a>

标题格式中包含的网址则是每一个段子单独的详细的链接。

包含十个类似的标题格式以<h1 class="dp-b"><a href="开头,以"结尾。中间为每个段子的YRL链接。

段子URL规律:

进入此链接通过查看网页源代码可以发现网页源代码总共包含两个<h1>:

通过观察我们发现第一个<h1>后面包含我们想要的信息:段子的标题和段子的内容。

段子的标题以<h1>开头,以</h1>结尾,可以过滤内容。只取第一个开头到结尾。

段子的内容以<div class="content-txt pt10">开头,以<a id="prev" href="结尾。开头结尾只有一个。

代码实现:通过主页有用的链接爬取内容

第一步代码:爬取主页上的段子链接

package main
import (
	"fmt"
	"net/http"
	"regexp"
	"strconv"
)
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url) //发送get请求
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页内容
	buf := make([]byte, 1024*4)
	for {
		n, _ := resp.Body.Read(buf)
		if n == 0 {
			break
		}
		result += string(buf[:n]) //累加读取内容
	}
	return
}
func SpiderPape(i int) {
	//明确需要爬取的url
	//https://www.pengfu.com/xiaohua_1.html
	url := "https://www.pengfu.com/xiaohua_" + strconv.Itoa(i) + ".html"
	fmt.Printf("正在爬取网页%d:%s\n", i, url)
	//开始爬取主页的链接
	result, err := HttpGet(url)
	if err != nil {
		fmt.Println("HttpGet err = ", err)
		return
	}
	//fmt.Println(result)
	//取url链接-正则表达式,以<h1 class="dp-b"><a href="开头,以"结尾
	//解释表达式
	re := regexp.MustCompile(`<h1 class="dp-b"><a href="(?s:(.*?))"`)
	if re == nil {
		fmt.Println("regexp.MustCompile err ")
		return
	}
	//取关键信息
	joyUrls := re.FindAllStringSubmatch(result, -1)
	//fmt.Println(joyUrls)
	//取网址
	//第一个返回下标,第二个返回内容
	for _, data := range joyUrls {
		fmt.Println(data[1])
	}
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面网址\n", start, end)
	for i := start; i <= end; i++ {
		//定义函数爬主页面
		SpiderPape(i)
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end) //工作函数
}

第二部代码:取出并输出标题和内容

package main
import (
	"fmt"
	"net/http"
	"regexp"
	"strconv"
	"strings"
)
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url) //发送get请求
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页内容
	buf := make([]byte, 1024*4)
	for {
		n, _ := resp.Body.Read(buf)
		if n == 0 {
			break
		}
		result += string(buf[:n]) //累加读取内容
	}
	return
}
//开始爬取每一个段子
func SpiderOneJoy(url string) (title, content string, err error) {
	//开始爬取段子信息
	result, err1 := HttpGet(url)
	if err1 != nil {
		//fmt.Println("HttpGet err = ", err1)
		err = err1
		return
	}
	//取关键信息
	//取标题,标题以<h1>开头,以</h1>结尾,
	re1 := regexp.MustCompile(`<h1>(?s:(.*?))</h1>`)
	if re1 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpTitle := re1.FindAllStringSubmatch(result, 1) //只过滤第一个
	for _, data := range tmpTitle {
		title = data[1]
		//title = strings.Replace(title, "\r", "", -1)
		//title = strings.Replace(title, "\n", "", -1)
		//title = strings.Replace(title, " ", "", -1)
		title = strings.Replace(title, "\t ", "", -1) //剔除干扰字符换成空字符
		break
	}
	//取内容,内容以<div class="content-txt pt10">开头,以<a id="prev" href="结尾
	re2 := regexp.MustCompile(`<div class="content-txt pt10">(?s:(.*?))<a id="prev" href="`)
	if re2 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpContent := re2.FindAllStringSubmatch(result, -1)
	for _, data := range tmpContent {
		content = data[1]
		content = strings.Replace(content, "\t", "", -1)
		content = strings.Replace(content, "\n", "", -1)
		content = strings.Replace(content, "\r", "", -1)
		content = strings.Replace(content, "<br />", "", -1)
		break
	}
	return
}
func SpiderPape(i int) {
	//明确需要爬取的url
	//https://www.pengfu.com/xiaohua_1.html
	url := "https://www.pengfu.com/xiaohua_" + strconv.Itoa(i) + ".html"
	fmt.Printf("正在爬取网页%d:%s\n", i, url)
	//开始爬取主页的链接
	result, err := HttpGet(url)
	if err != nil {
		fmt.Println("HttpGet err = ", err)
		return
	}
	//fmt.Println(result)
	//取url链接-正则表达式,以<h1 class="dp-b"><a href="开头,以"结尾
	//解释表达式
	re := regexp.MustCompile(`<h1 class="dp-b"><a href="(?s:(.*?))"`)
	if re == nil {
		fmt.Println("regexp.MustCompile err ")
		return
	}
	//取关键信息
	joyUrls := re.FindAllStringSubmatch(result, -1) //过滤全部
	//fmt.Println(joyUrls)
	//取网址
	//第一个返回下标,第二个返回内容
	for _, data := range joyUrls {
		//fmt.Println(data[1])
		//开始爬取每一个段子
		title, content, err := SpiderOneJoy(data[1])
		if err != nil {
			fmt.Println("SpiderOneJoy err = ", err)
			continue
		}
		fmt.Printf("title = #%v#\n", title)
		fmt.Printf("content = #%v#\n", content)
	}
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面网址\n", start, end)
	for i := start; i <= end; i++ {
		//定义函数爬主页面
		SpiderPape(i)
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end) //工作函数
}

第三步代码:输出到文件

package main
import (
	"fmt"
	"net/http"
	"os"
	"regexp"
	"strconv"
	"strings"
)
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url) //发送get请求
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页内容
	buf := make([]byte, 1024*4)
	for {
		n, _ := resp.Body.Read(buf)
		if n == 0 {
			break
		}
		result += string(buf[:n]) //累加读取内容
	}
	return
}
//开始爬取每一个段子
func SpiderOneJoy(url string) (title, content string, err error) {
	//开始爬取段子信息
	result, err1 := HttpGet(url)
	if err1 != nil {
		//fmt.Println("HttpGet err = ", err1)
		err = err1
		return
	}
	//取关键信息
	//取标题,标题以<h1>开头,以</h1>结尾,
	re1 := regexp.MustCompile(`<h1>(?s:(.*?))</h1>`)
	if re1 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpTitle := re1.FindAllStringSubmatch(result, 1) //只过滤第一个
	for _, data := range tmpTitle {
		title = data[1]
		//title = strings.Replace(title, "\r", "", -1)
		//title = strings.Replace(title, "\n", "", -1)
		//title = strings.Replace(title, " ", "", -1)
		title = strings.Replace(title, "\t ", "", -1) //剔除干扰字符换成空字符
		break
	}
	//取内容,内容以<div class="content-txt pt10">开头,以<a id="prev" href="结尾
	re2 := regexp.MustCompile(`<div class="content-txt pt10">(?s:(.*?))<a id="prev" href="`)
	if re2 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpContent := re2.FindAllStringSubmatch(result, -1)
	for _, data := range tmpContent {
		content = data[1]
		content = strings.Replace(content, "\t", "", -1)
		content = strings.Replace(content, "\n", "", -1)
		content = strings.Replace(content, "\r", "", -1)
		content = strings.Replace(content, "<br />", "", -1)
		break
	}
	return
}
//把内容写入到文件
func StoreJoyToFile(i int, fileTitle []string, fileContent []string) {
	//新建文件
	f, err := os.Create(strconv.Itoa(i) + ".txt")
	if err != nil {
		fmt.Println("os.Create err = ", err)
		return
	}
	defer f.Close()
	//写内容
	n := len(fileTitle)
	for i := 0; i < n; i++ {
		//写标题
		f.WriteString(fileTitle[i] + "\n")
		//写内容
		f.WriteString(fileContent[i] + "\n")
		f.WriteString("\n-----------------------------------\n")
	}
}
func SpiderPape(i int) {
	//明确需要爬取的url
	//https://www.pengfu.com/xiaohua_1.html
	url := "https://www.pengfu.com/xiaohua_" + strconv.Itoa(i) + ".html"
	fmt.Printf("正在爬取网页%d:%s\n", i, url)
	//开始爬取主页的链接
	result, err := HttpGet(url)
	if err != nil {
		fmt.Println("HttpGet err = ", err)
		return
	}
	//fmt.Println(result)
	//取url链接-正则表达式,以<h1 class="dp-b"><a href="开头,以"结尾
	//解释表达式
	re := regexp.MustCompile(`<h1 class="dp-b"><a href="(?s:(.*?))"`)
	if re == nil {
		fmt.Println("regexp.MustCompile err ")
		return
	}
	//取关键信息
	joyUrls := re.FindAllStringSubmatch(result, -1) //过滤全部
	//fmt.Println(joyUrls)
	fileTitle := make([]string, 0)
	fileContent := make([]string, 0)
	//取网址
	//第一个返回下标,第二个返回内容
	for _, data := range joyUrls {
		//fmt.Println(data[1])
		//开始爬取每一个段子
		title, content, err := SpiderOneJoy(data[1])
		if err != nil {
			fmt.Println("SpiderOneJoy err = ", err)
			continue
		}
		//fmt.Printf("title = #%v#", title)
		//fmt.Printf("content = #%v#", content)
		fileTitle = append(fileTitle, title)       //追加内容
		fileContent = append(fileContent, content) //追加内容
	}
	fmt.Println("fileTitle = ", fileTitle)
	fmt.Println("fileContent = ", fileContent)
	//把内容写入到文件
	StoreJoyToFile(i, fileTitle, fileContent)
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面网址\n", start, end)
	for i := start; i <= end; i++ {
		//定义函数爬主页面
		SpiderPape(i)
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end) //工作函数
}

并发爬虫:速度优势

package main
import (
	"fmt"
	"net/http"
	"os"
	"regexp"
	"strconv"
	"strings"
)
func HttpGet(url string) (result string, err error) {
	resp, err1 := http.Get(url) //发送get请求
	if err != nil {
		err = err1
		return
	}
	defer resp.Body.Close()
	//读取网页内容
	buf := make([]byte, 1024*4)
	for {
		n, _ := resp.Body.Read(buf)
		if n == 0 {
			break
		}
		result += string(buf[:n]) //累加读取内容
	}
	return
}
//开始爬取每一个段子
func SpiderOneJoy(url string) (title, content string, err error) {
	//开始爬取段子信息
	result, err1 := HttpGet(url)
	if err1 != nil {
		//fmt.Println("HttpGet err = ", err1)
		err = err1
		return
	}
	//取关键信息
	//取标题,标题以<h1>开头,以</h1>结尾,
	re1 := regexp.MustCompile(`<h1>(?s:(.*?))</h1>`)
	if re1 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpTitle := re1.FindAllStringSubmatch(result, 1) //只过滤第一个
	for _, data := range tmpTitle {
		title = data[1]
		//title = strings.Replace(title, "\r", "", -1)
		//title = strings.Replace(title, "\n", "", -1)
		//title = strings.Replace(title, " ", "", -1)
		title = strings.Replace(title, "\t ", "", -1) //剔除干扰字符换成空字符
		break
	}
	//取内容,内容以<div class="content-txt pt10">开头,以<a id="prev" href="结尾
	re2 := regexp.MustCompile(`<div class="content-txt pt10">(?s:(.*?))<a id="prev" href="`)
	if re2 == nil {
		//fmt.Println("regexp.MustCompile err ")
		err = fmt.Errorf("%s", "regexp.MustCompile err")
		return
	}
	//取内容
	tmpContent := re2.FindAllStringSubmatch(result, -1)
	for _, data := range tmpContent {
		content = data[1]
		content = strings.Replace(content, "\t", "", -1)
		content = strings.Replace(content, "\n", "", -1)
		content = strings.Replace(content, "\r", "", -1)
		content = strings.Replace(content, "<br />", "", -1)
		break
	}
	return
}
//把内容写入到文件
func StoreJoyToFile(i int, fileTitle []string, fileContent []string) {
	//新建文件
	f, err := os.Create(strconv.Itoa(i) + ".txt")
	if err != nil {
		fmt.Println("os.Create err = ", err)
		return
	}
	defer f.Close()
	//写内容
	n := len(fileTitle)
	for i := 0; i < n; i++ {
		//写标题
		f.WriteString(fileTitle[i] + "\n")
		//写内容
		f.WriteString(fileContent[i] + "\n")
		f.WriteString("\n-----------------------------------\n")
	}
}
func SpiderPape(i int, page chan int) {
	//明确需要爬取的url
	//https://www.pengfu.com/xiaohua_1.html
	url := "https://www.pengfu.com/xiaohua_" + strconv.Itoa(i) + ".html"
	fmt.Printf("正在爬取网页%d:%s\n", i, url)
	//开始爬取主页的链接
	result, err := HttpGet(url)
	if err != nil {
		fmt.Println("HttpGet err = ", err)
		return
	}
	//fmt.Println(result)
	//取url链接-正则表达式,以<h1 class="dp-b"><a href="开头,以"结尾
	//解释表达式
	re := regexp.MustCompile(`<h1 class="dp-b"><a href="(?s:(.*?))"`)
	if re == nil {
		fmt.Println("regexp.MustCompile err ")
		return
	}
	//取关键信息
	joyUrls := re.FindAllStringSubmatch(result, -1) //过滤全部
	//fmt.Println(joyUrls)
	fileTitle := make([]string, 0)
	fileContent := make([]string, 0)
	//取网址
	//第一个返回下标,第二个返回内容
	for _, data := range joyUrls {
		//fmt.Println(data[1])
		//开始爬取每一个段子
		title, content, err := SpiderOneJoy(data[1])
		if err != nil {
			fmt.Println("SpiderOneJoy err = ", err)
			continue
		}
		//fmt.Printf("title = #%v#", title)
		//fmt.Printf("content = #%v#", content)
		fileTitle = append(fileTitle, title)       //追加内容
		fileContent = append(fileContent, content) //追加内容
	}
	//fmt.Println("fileTitle = ", fileTitle)
	//fmt.Println("fileContent = ", fileContent)
	//把内容写入到文件
	StoreJoyToFile(i, fileTitle, fileContent)
	page <- i	//向管道内写入内容num
}
func DoWork(start, end int) {
	fmt.Printf("正在爬取 %d 到 %d 页面网址\n", start, end)
	page := make(chan int)
	for i := start; i <= end; i++ {
		//定义函数爬主页面
		go SpiderPape(i, page)
	}
	for i := start; i <= end; i++ {
		fmt.Printf("%d页面爬取完成\n", <-page)
	}
}
func main() {
	var start, end int
	fmt.Println("请输入起始页(>=1)")
	fmt.Scan(&start)
	fmt.Println("请输入起始页(>=起始页)")
	fmt.Scan(&end)
	DoWork(start, end) //工作函数
}

捧腹网可能存在服务器不稳定的情况,若出现panic: runtime error: invalid memory address or nil pointer dereference错误,则很有可能是由于网络或服务器原因。

Go语言基础学习部分就到此为止了,可视化程序部分(含两篇笔记)暂不做学习。

Guess you like

Origin blog.csdn.net/qq_18800771/article/details/97667637