Use Python native crawler to crawl simple information of blog posts

Today we use Python's native way to crawl some simple information of our own blog.

ready

Before making a simple Python crawler, we need to analyze what information we need to use the crawler to crawl, and how to parse out the information we need.

Because today is a simple Python crawler, so the information we want to crawl is not very complicated.
We need to crawl the five pieces of information in the blog, namely: article URL, article title, article upload time, article reading number, article comment number .
Because we are using Python's native way to crawl information, we also use regular expressions to parse the crawled data directly to obtain the information we need from the data.

Project structure

Insert picture description here

article.py module (information object)

In order to facilitate the packaging of crawling information, we create an information class under the article.py module, which encapsulates the information we need to crawl.

class Article(object):
    url = ''
    title = ''
    date = ''
    read_num = 0
    comment_num = 0

    def __init__(self, url, title, date, read_num, comment_num):
        self.url = url
        self.title = title
        self.date = date
        self.read_num = read_num
        self.comment_num = comment_num

spider.py module (reptile)

We create a reptile class under the spider.py module.

import re
from urllib import request

from article import Article


class Spider():
    # 博客路径
    url = 'https://blog.csdn.net/qq_45193304/article/list/'
    # 单个博客文章内容块的正则表达式
    article_pattern = '<div class="article-item-box csdn-tracking-statistics" data-articleid="[\d]*?">[\s\S]*?</div>[\S\s]*?</div>'
    # 单个博客文章URL和标题的正则表达式
    title_pattern = '<a href="([\s\S]*?)" target="_blank">[\S\s]*?<span class="article-type type-1 float-none">原创</span>([\s\S]*?)</a>'
    # 单个博客文章上传时间的正则表达式
    date_pattern = '<span class="date">([\d\D]*?)</span>'
    # 单个博客文章阅读次数的正则表达式
    read_num_pattern = '<span class="read-num">阅读数 <span class="num">([\d]*?)</span> </span>'
    # 单个博客文章评论次数的正则表达式
    comment_num_pattern = '<span class="read-num">评论数 <span class="num">([\d]*?)</span> </span>'

    # 通过url获取html页面内容,并通过utf-8进行解码
    def __fetch_content(self, url):
        r = request.urlopen(url)
        html = r.read()
        return str(html, encoding='utf-8')

    # 通过html页面获取页面中的文章内容块
    def __get_article(self, html):
        article_html = re.findall(self.article_pattern, html)
        return article_html

    # 对每个文章内容块进行处理,获取博客的url和标题
    def __get_url_and_title(self, article):
        r = re.search(self.title_pattern, article)
        url = r.group(1).strip()
        title = r.group(2).strip()
        return url, title

    # 对每个文章内容块进行处理,获取博客的上传时间
    def __get_date(self, article):
        r = re.search(self.date_pattern, article)
        date = r.group(1).strip()
        return date

    # 对每个文章内容块进行处理,获取博客的阅读数
    def __get_read_num(self, article):
        r = re.search(self.read_num_pattern, article)
        read_num = int(r.group(1))
        return read_num

    # 对每个文章内容块进行处理,获取博客的评论数
    def __get_comment_num(self, article):
        r = re.search(self.comment_num_pattern, article)
        comment_num = int(r.group(1))
        return comment_num

    # 对博客文章列表进行遍历,然后分别处理每一个博客文章内容,进行封装处理
    def __do_package(self, article_html, article_list):
        for article in article_html:
            url, title = self.__get_url_and_title(article)
            date = self.__get_date(article)
            read_num = self.__get_read_num(article)
            comment_num = self.__get_comment_num(article)
            article = Article(url, title, date, read_num, comment_num)
            article_list.append(article)

    # 爬虫的入口函数
    def go(self):
        page = 1
        article_list = []
        while page:
            url = self.url + str(page)
            page += 1
            html = self.__fetch_content(url)
            article_html = self.__get_article(html)
            if len(article_html) == 0:
                break
            self.__do_package(article_html,article_list)
        return article_list

# 实例化一个Spider对象
spider = Spider()
# 调用spider的go()方法,获取一个article对象的列表
article_list = spider.go()

# 一共爬取了多少条信息
print('爬虫一共爬取了' + str(len(article_list)) + '条博客文章信息')

# 遍历列表,打印列表中的数据
for article in article_list:
    print(article.__dict__)

End

Through the above code, the information we crawled in the native Python way will be printed directly on the console.
If we need to crawl the information of other blog users, we only need to modify urlthe value.

The last thing to say is that CSDN does not anti-crawl the blog content, but it has an anti-crawling mechanism for blog data such as traffic, which can prevent users from maliciously brushing the traffic through the crawler, so those who want to brush the traffic through the crawler Students can wash and sleep.

Published 85 original articles · won 92 · views 9221

Guess you like

Origin blog.csdn.net/qq_45193304/article/details/105454839