How crawling all articles CSDN blog carved column headings and links

How crawling all articles CSDN blog carved column headings and links

Today, I am writing a blog a blog article navigation own articles, trying various columns of the article to be a summary of navigation, the first few columns Fortunately, not many articles, until Collation Algorithm subject article, and instantly found question wrong, although written not long, but there are about 100 arithmetic problem, and that if the handwriting was written long ah. This time I can not think of myself crawling the article title and link columns of it?
For serious reasons, bloggers or whatever looked robots.txt file CSDN also reptiles agreement is, after all, bloggers do not want to sit in a hole ah ~ ~
CSDN reptiles agreed as follows:
Here Insert Picture Description
That we just crawling own blog the title of the article and the article did not address is the problem. So say a pound, and directly on the bar code, very simple string of code in the code for some of the functions of the Notes.

import re
import requests
from bs4 import BeautifulSoup

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
    # 网站Host
    'Host': 'blog.csdn.net'
}

# 网页链接
link = 'https://blog.csdn.net/qq_43422111/category_8847039.html'

# 获取网页
r = requests.get(link, headers=headers, timeout=10)

# 使用soup进行过滤
soup = BeautifulSoup(r.text, "lxml")

div_list = soup.find_all('li')

articles = []
for each in div_list:
    article = {}
    href = str(each.find('a').get('href')).strip()
    title = each.find('h2',class_ = "title")
    title = str(title)
    # 这里的title还需要进一步的处理,因为CSDN在这个h2标题下面还有两个注释:<!--####试读--><!--####试读-->这里用正则把注释去掉
    re_comment = re.compile('<!--[^>]*-->')
    title = re_comment.sub("",title)
    # 去掉注释之后我们在用正则来提取<h2></h2>之间的内容
    # 这里由于</h2>的标签在下一行,中间有个\n,所里这里的结束标签设置为了\n,大家可以根据实际情况进行修改
    # 这样最终的标题和链接就提取出来了
    title_re = re.findall(r'<h2 class="title">(.*?)\n',title)
    # 发现这时标题后面还跟着一堆空格,我们把空格去除一下
    # 先把列表转换为字符串,然后使用字符串的strip()函数即可去除头尾的空格
    mid = "".join(title_re)
    title_f = mid.strip()
    article['href'] = href
    article['title'] = title_f
    articles.append(article)

# 下面我们就来将处理好的这个articles列表按照Markdown本文超链接的格式保存到我们的文本中

# 保存文件的地址
mylog = open('/Users/qiguan/article.txt',mode='a',encoding='utf-8')

for article in articles:
    # 这里我们在控制台也输出一遍
    print("[", article['title'], "]", "(", article['href'], ")",sep='', end='\n')
    # 保存到文本中
    print("[",article['title'],"]","(",article['href'],")",sep='',end='\n',file = mylog)



Here is the print data to the console and to save our files, just to write a column divides crawling, but navigation is used to write the blog completely sufficient friends

Published 151 original articles · won praise 230 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_43422111/article/details/105176018