Golang核心编程(9)-使用net/http及goquery库爬取CSDN首页文章

版权声明:需要引用、发表的朋友请与本人联系 https://blog.csdn.net/pbrlovejava/article/details/84136691


更多关于Golang核心编程知识的文章请看:Golang核心编程(0)-目录页


goquery是golang的一个爬虫常用第三方库,它主要的作用是处理html文档,将其我们需要的内容进行筛选处理。goquery是golang领域的jquery,它的使用和jquery的选择器有十分相似,如果你学过jquery,那么将十分容易上手。

一、goquery库的安装

具体的安装方式网上讲得很清楚,但是你可能会遇到以下问题:
报错package golang.org/x/net/websocket: unrecognized import path
原因在于本地缺少一个golang.org/x/net的包,用以下方法可以解决:

https://blog.csdn.net/qq_31967569/article/details/81060525

二、goquery的使用

网上有两篇文章讲得很清楚,这里就不再讲了,大家可以查阅:

三、爬取CSDN首页文章

3.1、需求分析

显示,打开CSDN的首页,我们先确定我们需要爬的数据是什么
在这里插入图片描述
这次我打算爬的是首页文章中的文章名文章地址文章作者以及阅读数

3.2、分析当前页面的html文档

右键,查看页面的源代码,并将源代码拷贝到阅读工具中方便阅读,这里我用的是notepad++去分析html文档
在这里插入图片描述

  • 1、首先锁定文章名所在的位置
    在这里插入图片描述

很快可以发现,文章名的外层是一个a标签,而所有的文章数据都以 list_con这个class的方式循环而得,所以,获取文章名的goquery选择器可以这么写:

document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		name1 := selection.Find("div.title h2 a")
		fmt.Printf("name is :%v\n",name1.Text())
	})

得出以下结果,成功地获取了所有文章的名字,第一步完成
在这里插入图片描述

  • 2、锁定文章地址位置
    在这里插入图片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		url,_:= selection.Find(".read_num").Find("a").Attr("href")
		fmt.Printf("url is :%v\n",url)
	})

这里锁定了a标签之后,用了Attr方法去获得href属性中的地址值:
在这里插入图片描述

  • 3、锁定文章作者
    在这里插入图片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		author := selection.Find(".name").Find("a")
		fmt.Printf("author is :%v\n",author.Text())
	})

在这里插入图片描述

  • 4、锁定文章阅读数
    在这里插入图片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
   	numbers := selection.Find(".read_num").Find("a span.num")
   	fmt.Printf("numbers is :%v\n",numbers.Text())
   })

在这里插入图片描述

四、爬虫完整程序

package main

import (
	"net/http"
	"github.com/PuerkitoBio/goquery"
	"fmt"
	"strings"
)

func main() {
	//获得response
	response, err := http.Get("https://www.csdn.net/")
	if err != nil{
		return
	}
	//使用goquery解析response响应体获得html文档
	document, err := goquery.NewDocumentFromReader(response.Body)
	if err != nil{
		return
	}
	defer response.Body.Close()
	//开始解析
	 document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		name1 := selection.Find("div.title h2 a")
		 url,_:= selection.Find(".read_num").Find("a").Attr("href")
		 author := selection.Find(".name").Find("a")
		 numbers := selection.Find(".read_num").Find("a span.num")

		fmt.Printf("index is :%d|name is :%v|author is :%v|numbers is :%v|url is :%v \n",i,strings.TrimSpace(name1.Text()),strings.TrimSpace(author.Text()),numbers.Text(),strings.TrimSpace(url))
	})
 
}

成功地爬取了CSDN首页的文章数据,之后可以将其写入文件中或者数据库中,如需改进性能的话可以改用多协程,有兴趣的朋友可以深入研究!

index is :0|name is :Python爬取抖音APP,竟然只需要十行代码|author is :娇兮心有之|numbers is :2262|url is :https://blog.csdn.net/qq_40925239/article/details/83786958 
index is :1|name is :千万别做老板最不能容忍的三种人 z|author is :这个也很漂亮|numbers is :1712|url is :https://blog.csdn.net/hdfghh/article/details/83955147 
index is :2|name is :程序员晒出小学儿子满分作文《我的爸爸》,真实的让人心疼|author is :taya_a|numbers is :1642|url is :https://blog.csdn.net/taya_a/article/details/83958356 
index is :3|name is :腾讯 阿里 华为的岗位薪资情况概述|author is :小风花|numbers is :2114|url is :https://blog.csdn.net/hdfyhf/article/details/83931804 
index is :4|name is :震惊,20年开发经验的技术总监不会搭建Java开发环境|author is :Java填坑之路|numbers is :3733|url is :https://blog.csdn.net/yelvgou9995/article/details/83961061 
index is :5|name is :在操作系统、芯片领域跌倒的中国程序员,如何崛起?|author is :残留的淡影|numbers is :835|url is :https://blog.csdn.net/weixin_43587861/article/details/83958910 
index is :6|name is :刚写完排序算法,就被开除了…|author is :Java技术栈|numbers is :605|url is :https://blog.csdn.net/youanyyou/article/details/84026290 
index is :7|name is :有个程序员男友是什么感觉?女网友:连约个会都要处理BUG!|author is :不玩代码的一鸣|numbers is :2051|url is :https://blog.csdn.net/weixin_43338842/article/details/83932502 
index is :8|name is :程序员吐槽阿里加班文化上班太累,网友:做程序员这也算高强度?|author is :不玩代码的一鸣|numbers is :1502|url is :https://blog.csdn.net/weixin_43338842/article/details/83932471 
index is :9|name is :sql 存储过程|author is :树叶子hza|numbers is :1712|url is :https://blog.csdn.net/hza419763578/article/details/83961826 
index is :10|name is :【软件设计师】——总结|author is :邢美玲|numbers is :233|url is :https://blog.csdn.net/xml1996/article/details/83959290 
index is :11|name is :虚拟机和Docker的最大区别|author is :JerryWangSAP|numbers is :467|url is :https://blog.csdn.net/i042416/article/details/84034510 
index is :12|name is :快进来看程序员风格的修真小说!|author is :Java填坑之路|numbers is :485|url is :https://blog.csdn.net/yelvgou9995/article/details/84067063 
index is :13|name is :ORACLE/MYSQL数据库的常用SQL命令|author is :SunJW_2017|numbers is :399|url is :https://blog.csdn.net/SunJW_2017/article/details/84023425 
index is :14|name is :ERP工程师的职责是什么|author is :这个也很漂亮|numbers is :432|url is :https://blog.csdn.net/hdfghh/article/details/84059994 
index is :15|name is :springMVC学习心得及手写springMVC简单实现|author is :棒叔叔|numbers is :245|url is :https://blog.csdn.net/qq_41785135/article/details/83781493 
index is :16|name is :自动化运维一体化|author is :Stestack|numbers is :585|url is :https://blog.csdn.net/Stestack/article/details/83963083 
index is :17|name is :#程序员式幽默趣图!从高的职业,现实的残酷!|author is :javam16|numbers is :240|url is :https://blog.csdn.net/javam16/article/details/83957962 
index is :18|name is :Springboot实现用户登录|author is :HOWSUNSHINE|numbers is :421|url is :https://blog.csdn.net/HOWSUNSHINE/article/details/83988456 
index is :19|name is :多台SQLServer数据实时同步|author is :weixin_37691493|numbers is :296|url is :https://blog.csdn.net/weixin_37691493/article/details/83960586 
index is :20|name is :一名年薪百万阿里P8架构师写给Java程序员一些建议(架构师必备)|author is :M阿|numbers is :420|url is :https://blog.csdn.net/yupi1057/article/details/84068697 
index is :21|name is :在 Java 中初始化 List 的五种方法|author is :Java填坑之路|numbers is :310|url is :https://blog.csdn.net/yelvgou9995/article/details/83933095 
index is :22|name is :徐小平 不做人生规划,你离挨饿只有三天|author is :这个也很漂亮|numbers is :274|url is :https://blog.csdn.net/hdfghh/article/details/83955208 
index is :23|name is :Spring 的体系结构|author is :mukes|numbers is :180|url is :https://blog.csdn.net/mukes/article/details/84071658 
index is :24|name is :浅淡XSS跨站脚本攻击的防御方法|author is :白帽梦想家|numbers is :176|url is :https://blog.csdn.net/sdb5858874/article/details/84033195 
index is :25|name is :是否在公司里 老板叫你做什么 就做什么的总结|author is :牛仔裤新的|numbers is :204|url is :https://blog.csdn.net/jgfyyfd/article/details/83935051 
index is :26|name is :长相一般的普通程序员怎么找到喜欢程序员的妹子做女友?|author is :北辰丶|numbers is :175|url is :https://blog.csdn.net/qq_43093708/article/details/83933576 
index is :27|name is :栈的基本函数C++实现|author is :liaolian1|numbers is :154|url is :https://blog.csdn.net/liaolian1/article/details/84074829 
index is :28|name is :Java并发——阻塞队列|author is :Crazy_CMT|numbers is :168|url is :https://blog.csdn.net/qq_38386085/article/details/84035841 

猜你喜欢

转载自blog.csdn.net/pbrlovejava/article/details/84136691