About how to reprint articles quickly! ——One script is enough

Introduction

On the Internet, we often see some valuable articles. If we want to save these articles, we can choose to export them as text files or PDF files. But what if we want to save it as a .md file (the file format used by markdown readers)? This article will introduce how to use Python scripts to export online articles into .md files.

accomplish

The main functions implemented by this Python script are as follows:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import html2text
import re

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'Referer': 'http://www.example.com/'
}

def export_article(url):
    # 发送 GET 请求获取文章页面内容
    response = requests.get(url, headers=headers)
    html_content = response.text

    # 使用 BeautifulSoup 解析 HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # 提取文章标题和内容
    title = soup.select_one('.title-article').text.strip()
    article_content = soup.select_one('.article_content').prettify()

    # 将文章内容转换为 Markdown 格式
    md_content = html2text.html2text(article_content)

    # 对文章的图片做处理(部分图片的链接会自动换行)
    pattern = r'(https?://[^\s\r\n]+)[\r\n\s]+'
    replacement = r'\1'
    for _ in range(2):
        md_content = re.sub(pattern, replacement, md_content)

    # 将 Markdown 内容保存为文件
    with open(f"{
      
      title}.md", "w", encoding="utf-8") as f:
        f.write(md_content)

    print(f"文章已成功导出为 {
      
      title}.md")


export_article("https://blog.csdn.net/xxxxxxxxxx")

By using this script, users can easily save articles on the Internet as .dm files for offline reading. At the same time, users can also flexibly adjust the format of .dm files according to their own needs to meet personalized reading needs.

Please note that this script is for learning and reference only and may not be used for other illegal or commercial purposes without authorization. When using scripts, please make sure to comply with the terms of use and copyright regulations of each site and respect the rights of the original author.

Guess you like

Origin blog.csdn.net/weixin_40301728/article/details/131969291