Use Python to download all blog posts of a certain blogger and save them in pdf format

Article Directory

One, analyze the idea of ​​writing code

Second, the code steps

1. Import the required libraries

2. Analyze the homepage of a blog

3. Extract the required data

4. Traverse the URL of each blogger’s article

Many people learn python and don't know where to start.

Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.

Many people who have done case studies do not know how to learn more advanced knowledge.

So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course! ??¤

QQ group: 701698587

5. Construct html web page

6. Create a folder

7. Save the html file

8. Convert html file to pdf file

3. The total code and results

to sum up

One, analyze the idea of ​​writing code

  1. 1. Author url+headers 
  2. 2. See if the URL of the author is a static webpage 
  3. 3. Analyze the webpage, get the url of each work of the author, and the name of the author 
  4. 4. Continue to visit according to the url of each work, and then analyze the data 
  5. 5. Extract html text string, title 
  6. 6. Create a folder 
  7. 7. Save html text 
  8. 8. Convert pdf text

2. Code step
1. Import the required library The
code is as follows (example):

import requests,parsel,os,pdfkit
from lxml import etree


2. Analyze the homepage of a certain blog
2.1. Click into the website of a certain blogger at any time, for example: "w wants to become stronger" blogger

2.2 Click on Developer Tools to refresh and load the URL of the blogger's homepage


2.3 Right-click to view the source code of the webpage and found that the blogger's homepage is a static webpage. Here I chose xpath to parse the webpage. Of course, you can also use css selector, beautifulsoup and other parsers.

code show as below:

    #1.author_url+headers
    author_url=input('请输入csdn博主的url:')
    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/87.0.4280.88 Safari/537.36'}
    response = requests.get(author_url,headers=headers).text
    # 2.作者所在的url是静态网页,xpath解析每个作品url
    html_xpath = etree.HTML(response)


 

3. Extract the required data
3.1 Extract the name of the blog and the URL of all works


code show as below:

 

   try:
        author_name = html_xpath.xpath(r'//*[@class="user-profile-head-name"]/div/text()')[0]
        # print(author_name)
        author_book_urls = html_xpath.xpath(r'//*[@class="blog-list-box"]/a/@href')
        # pprint(author_book_urls)
    except Exception as e:
        author_name = html_xpath.xpath(r'//*[@id="uid"]/span/text()')[0]
        author_book_urls = html_xpath.xpath(r'//*[@class="article-list"]/div/h4/a/@href')


The xpath path of the element can be copied directly. I will explain a little bit about the exception handling used here: in fact, there are still a few bloggers whose homepages are different, for example:

The analysis method is the same, except that the xpath path of the element is different, so no matter which form it is, the urls and name of the work can be extracted through exception handling. 

4. Traverse the URL of each article of the blogger
4.1 Each work webpage is also a static webpage, send a request, get the response and parse it

code show as below:

    for author_book_url in author_book_urls:
        book_res = requests.get(author_book_url,headers = headers).text
        #4.将响应分别用xpath,css选择器解析
        html_book_xpath = etree.HTML(book_res)
        html_book_css = parsel.Selector(book_res)


4.2 The css selector extracts the html text of the article, and the xpath extracts the title of the article


code show as below:

        book_title = html_book_xpath.xpath(r'//*[@id="articleContentId"]/text()')[0]
        html_book_content = html_book_css.css('#mainBox > main > div.blog-content-box').get()

The construction html page


   

    #5.拼接构造网页框架,加入文章html内容
        html =\
            '''
            <!DOCTYPE html>
            <html lang="en">
            <head>
                <meta charset="UTF-8">
                <title>Title</title>
            </head>
            <body>
                {}
            </body>
            </html>

6. Create a folder

        #6.创建博主文件夹
        if not os.path.exists(r'./{}'.format(author_name)):
            os.mkdir(r'./{}'.format(author_name))
7.保存html文件
        #6.保存html文本
        try:
            with open(r'./{}/{}.html'.format(author_name,book_title),'w',encoding='utf-8') as f:
                f.write(html)
            print('***{}.html文件下载成功****'.format(book_title))
        except Exception as e:
            continue


8. Convert html file to pdf file.
Physical condition of the converted file: need to download wkhtmltopdf.exe driver file!

​
        #8.转换pdf文本,导入pdfkit包
        try:
            config = pdfkit.configuration(
                wkhtmltopdf=r'D:\programs\wkhtmltopdf\bin\wkhtmltopdf.exe'
            )
            pdfkit.from_file(
                r'./{}/{}.html'.format(author_name,book_title),
                './{}/{}.pdf'.format(author_name,book_title),
                configuration=config
            )
            print(r'******{}.pdf文件保存成功******'.format(book_title))
        except Exception as e:
            continue


 

 

3. The total code and results

# !/usr/bin/env python
# -*- coding: utf-8 -*-
 
'''
    实现目标:爬某一博主的所有博客
    1.作者url+headers
    2.看作者所在的url是否是静态网页
    3.解析网页,获取作者的每个作品的url,及作者名字
    4.根据每个作品url继续访问,然后数据分析
    5.提取html文本,标题
    6.创建多级文件夹
    7.保存html文本
    8.转换pdf文本
'''
import requests,parsel,os,pdfkit
from lxml import etree
from pprint import pprint
def main():
    #1.author_url+headers
    author_url=input('请输入csdn博主的url:')
    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/87.0.4280.88 Safari/537.36'}
    response = requests.get(author_url,headers=headers).text
    # 2.作者所在的url是静态网页,xpath解析每个文章url
    html_xpath = etree.HTML(response)
 
    try:
        author_name = html_xpath.xpath(r'//*[@class="user-profile-head-name"]/div/text()')[0]
        # print(author_name)
        author_book_urls = html_xpath.xpath(r'//*[@class="blog-list-box"]/a/@href')
        # print(author_book_urls)
    except Exception as e:
        author_name = html_xpath.xpath(r'//*[@id="uid"]/span/text()')[0]
        author_book_urls = html_xpath.xpath(r'//*[@class="article-list"]/div/h4/a/@href')
 
    # print(author_name,author_book_urls,sep='\n')
 
    #3.遍历循环每个作品网址,请求网页
    for author_book_url in author_book_urls:
        book_res = requests.get(author_book_url,headers = headers).text
        #4.将响应分别用xpath,css选择器解析
        html_book_xpath = etree.HTML(book_res)
        html_book_css = parsel.Selector(book_res)
        book_title = html_book_xpath.xpath(r'//*[@id="articleContentId"]/text()')[0]
        html_book_content = html_book_css.css('#mainBox > main > div.blog-content-box').get()
 
        #5.拼接构造网页框架,加入文章html内容
        html =\
            '''
            <!DOCTYPE html>
            <html lang="en">
            <head>
                <meta charset="UTF-8">
                <title>Title</title>
            </head>
            <body>
                {}
            </body>
            </html>

 

Guess you like

Origin blog.csdn.net/Python_kele/article/details/115038240