Python 爬取博客园特定博主的文章

版权声明:学习技术是很辛苦的,希望转载文章的时候能注明下来源,做一个乐于分享的人:)。 https://blog.csdn.net/jishuzhain/article/details/83213158

Python 爬取博客园特定博主的文章

概述

爬取博客园特定博主的文章,当然代码参考了很多人的代码,东拼西凑的,最后完成了,很感谢网上乐于分享的网友。
之前搜索时发现一篇文章爬虫HelloWorld:爬取博客园某博主所有文章,但是博主并没实现,只能自己弄弄了。

环境

  • Python2.7
  • 环境:Python2.7、Windows10 运行成功
  • 使用html2text来转换为markdown格式文档,方便后续进行处理。
  • 在输出pdf文件时采用了多线程来加快速度
  • 使用了wkhtmltopdf作为转换工具

代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2018/10/12
# @Author  : jishuzhain
# @Link    :
# @Version : 1.0
# 图片需要绝对路径,或者直接下载图片到本地
# 采用多线程生成pdf文件

from __future__ import with_statement
import threading
import requests
from bs4 import BeautifulSoup
import html2text
import re
import time
import os
import pdfkit
# from PyPDF2 import PdfFileMerger
import sys
reload(sys)
sys.setdefaultencoding('utf8')

html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
</head>
<body>
<h2>{title}</h2>
{content}
</body>
</html>

"""

path = "D:\\html2md\\skyblue-li"


def get_html():
    p = 0
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}

    for n in range(10, 0, -1):
        url = 'https://www.cnblogs.com/skyblue-li/default.html?page={}'.format(n)
        # url = 'https://www.cnblogs.com/skyblue-li/p/5900100.html'
        time.sleep(1)
        r = requests.get(url, headers=headers, verify=False)
        soup = BeautifulSoup(r.text, "html.parser")
        blog_list = soup.find_all(class_='postTitle')
        for i in blog_list:
            time.sleep(2)
            href = i.find('a').get('href')
            href_response = requests.get(href, headers=headers, verify=False)
            href_soup = BeautifulSoup(href_response.text, "html.parser")
            title = href_soup.find(class_='postTitle').text
            # if "JavaWeb" in title:
            print title
            body = href_soup.find(id='cnblogs_post_body')
            # body = re.sub(r'<img src="//', "", body)
            body = re.sub(r'<img.*?src="//', "<img src=\"https://", str(body))
            # content = str(body)
            title = str(title)
            html = html_template.format(content=body, title=title)
            if os.path.exists(path):
                with open(path + '\\' + str(p) + ".html", 'w') as f:
                    f.write(html)
            else:
                os.mkdir(path)

            html_content = open(path + '\\' + str(p) + ".html", 'rb')
            md_content = html2text.html2text(html_content.read().decode('utf-8', 'ignore').replace(u'\xa9', u''))

            with open(path + '\\' + str(p) + '.md', 'w') as f:
                f.write(md_content)
            p += 1
    return p


def save_pdf(htmls, file_name):
    """
    把所有html文件保存到pdf文件
    :param htmls:  html文件列表
    :param file_name: pdf文件名
    :return:
    """
    pdfkit.from_file(htmls, file_name)


def main():
    name = get_html()
    threads = []
    for i in range(0, name):
        t = threading.Thread(target=save_pdf, args=(path + '\\' + str(i) + '.html',path + '\\' + str(i) + '.pdf', ))
        threads.append(t)

    for i in range(0, name):
        threads[i].start()

    for i in range(0, name):
        threads[i].join()


if __name__ == "__main__":
    main()

地址

github地址

猜你喜欢

转载自blog.csdn.net/jishuzhain/article/details/83213158