python 使用BeautifulSoup爬取CSDN博客(1)

最近刚好学py,刚好小组的群博好久没更新了
便想拿py来试试水

通过右上角打开rss订阅 
在这里插入图片描述
其中有很多是你之前的博客,但不是全部,这里我大概40篇博客它仅仅更新了
10多篇
在这里插入图片描述

接下来将这个rss的url 通过py脚本获取其中带
<a><href="https://blog.csdn.net/adlatereturn/article/details/108889759"></a>
的标签,这并不难,需要注意的是需要 BeautifulSoup(html, ‘lxml’),它这传下来的是xml,但网页上看其源代码其实是html,这的确很坑。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.error import HTTPError
from urllib.error import URLError
import random


def getAritcles(articleUrl):
    try:
        html = urlopen(articleUrl)
    except HTTPError as e:
        print(e)
        return None
    except URLError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'lxml')
        # with open('rss.xml', 'w+')as fd:
        #     fd.write(str(bs))
        # print(bs.title)
        for article in bs.findAll('a',text='原文链接'):
            if 'href' in article.attrs:
                print(article['href'])
    except Exception as e:
        print(e)
        return None

getAritcles('https://blog.csdn.net/adlatereturn/rss/list')

接着如果用如下方式

        for i in bs.find_all('a'):
            print i

可见find_all可以直接定位到标签

<a href="https://blog.csdn.net/adlatereturn/article/details/108889759">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108732422">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108502385">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108356380">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/108046579">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107753159">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107335130">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107335130#comments" target="_blank">查看评论</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107286812">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107585630">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107586703">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107445014">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/107281562">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106845518">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106293203">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/106167280">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105897403">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105780480">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105691795">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105586921">原文链接</a>
<a href="https://blog.csdn.net/adlatereturn/article/details/105452737">原文链接</a>

而通过  article['href']可获得正确的url
注意 使用article.get_text()只能得到后面的“原文链接”

        for article in bs.findAll('a'):
           if 'href' in article.attrs:
              print(article['href'])
https://blog.csdn.net/adlatereturn/article/details/108889759
https://blog.csdn.net/adlatereturn/article/details/108732422
https://blog.csdn.net/adlatereturn/article/details/108502385
https://blog.csdn.net/adlatereturn/article/details/108356380
https://blog.csdn.net/adlatereturn/article/details/108046579
https://blog.csdn.net/adlatereturn/article/details/107753159
https://blog.csdn.net/adlatereturn/article/details/107335130
https://blog.csdn.net/adlatereturn/article/details/107335130#comments
https://blog.csdn.net/adlatereturn/article/details/107286812
https://blog.csdn.net/adlatereturn/article/details/107585630
https://blog.csdn.net/adlatereturn/article/details/107586703
https://blog.csdn.net/adlatereturn/article/details/107445014
https://blog.csdn.net/adlatereturn/article/details/107281562
https://blog.csdn.net/adlatereturn/article/details/106845518
https://blog.csdn.net/adlatereturn/article/details/106293203
https://blog.csdn.net/adlatereturn/article/details/106167280
https://blog.csdn.net/adlatereturn/article/details/105897403
https://blog.csdn.net/adlatereturn/article/details/105780480
https://blog.csdn.net/adlatereturn/article/details/105691795
https://blog.csdn.net/adlatereturn/article/details/105586921
https://blog.csdn.net/adlatereturn/article/details/105452737

中间有个很唐突的查看评论

因此我们使用

        for article in bs.findAll('a',text='原文链接'):

排除掉查看评论

获得正确的url

https://blog.csdn.net/adlatereturn/article/details/108889759
https://blog.csdn.net/adlatereturn/article/details/108732422
https://blog.csdn.net/adlatereturn/article/details/108502385
https://blog.csdn.net/adlatereturn/article/details/108356380
https://blog.csdn.net/adlatereturn/article/details/108046579
https://blog.csdn.net/adlatereturn/article/details/107753159
https://blog.csdn.net/adlatereturn/article/details/107335130
https://blog.csdn.net/adlatereturn/article/details/107286812
https://blog.csdn.net/adlatereturn/article/details/107585630
https://blog.csdn.net/adlatereturn/article/details/107586703
https://blog.csdn.net/adlatereturn/article/details/107445014
https://blog.csdn.net/adlatereturn/article/details/107281562
https://blog.csdn.net/adlatereturn/article/details/106845518
https://blog.csdn.net/adlatereturn/article/details/106293203
https://blog.csdn.net/adlatereturn/article/details/106167280
https://blog.csdn.net/adlatereturn/article/details/105897403
https://blog.csdn.net/adlatereturn/article/details/105780480
https://blog.csdn.net/adlatereturn/article/details/105691795
https://blog.csdn.net/adlatereturn/article/details/105586921
https://blog.csdn.net/adlatereturn/article/details/105452737

这样我们就明确了我们需要爬去的目标了

之后只需要对每个url进行单独爬去即可

csdn爸爸求过审

猜你喜欢

转载自blog.csdn.net/adlatereturn/article/details/109037551