我的第一个爬虫—爬取自己在CSDN上写的文章名称及链接

写在前面，初学爬虫，解释器为Python 3.9.6 ，编辑器pycharm 社区版。

1.导入urllib库和 BeautifulSoup 库

#urllib 是 Python 的标准库
#urlopen用来打开并读取一个从网络获取的远程对象。
#它可以轻松读取 HTML 文件、图像文件，或其他任何文件流
'''
beautifulsoup 库,通过定位 HTML 标签来格式化和组织复杂的网络信息，用简单易用的 Python 对象为我们展现 XML 结构信息
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup

2.通过查看csdn个人主页的网页源代码(鼠标右键，查看页面源代码)，可以发现，写的文章(名称+链接)在 <div class="article-list>标签—> h4 标题—>a 标签(href)

3. 通过select() 函数来定位，遍历，通过文件操作写入csdn.txt。

for item in bsObj.select('div[class="article-list"] h4 a'):
    detail_url = item.get('href')#获取文章链接
    detail_txt = item.get_text()#获取文章的类型，名称
    f = open('csdn.txt','a+',encoding='utf-8')#附加写入,文件不存在则新建
    f.write(str(detail_txt)+str(detail_url)+'\n')
    f.close()

完整的代码:

#select() 方法

     #select() 方法
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://blog.csdn.net/qq_50932477?spm=1000.2115.3001.5343")
bsObj = BeautifulSoup(html, features='lxml')
#blockList = bsObj.findAll('div', class_="article-list")
for item in bsObj.select('div[class="article-list"] h4 a'):
    detail_url = item.get('href')
    detail_txt = item.get_text()
    f = open('csdn.txt','a+',encoding='utf-8')
    f.write(str(detail_txt)+str(detail_url)+'\n')
    f.close()
    print(detail_txt)
    print(detail_url)

#find().findAll() 方法

	#find().findAll() 方法
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://blog.csdn.net/qq_50932477?spm=1000.2115.3001.5343")
bsObj = BeautifulSoup(html, features='lxml')
for item in bsObj.find('div', class_="article-list").findAll('a'):
    #if 'href' in item.attrs:
        #print(item.attrs['href'])
    detail_url = item.get('href')
    detail_txt = item.get_text()
    print(detail_txt)  #  打印文章类型
    print(detail_url)  # 打印出所有包含href的元素的链接。

我的第一个爬虫—爬取自己在CSDN上写的文章名称及链接

猜你喜欢