[Office Automation] Convert Markdown files to plain text files in batches

This article explains how to convert Markdown files to plain text files. Markdown is a lightweight markup language for writing simple-formatted documents. However, sometimes we need to convert Markdown files to plain text files for other processing or viewing directly in the browser. Here is a simple way to implement this function.

Convert to html

To convert Markdown files to html files, you can use Python's markdownlibrary. First make sure the library is installed markdown. If not, you can install it using the following command:

pip install markdown

The Markdown file can then be converted to a plain text file using the following code:

import markdown

def md_to_txt(md_file, txt_file):
    with open(md_file, 'r', encoding='utf-8') as f:
        md_content = f.read()
        txt_content = markdown.markdown(md_content)
    
    with open(txt_file, 'w', encoding='utf-8') as f:
        f.write(txt_content)

md_file = 'example.md'  # Markdown文件路径
txt_file = 'example.html'  # 转换后的纯文本文件路径
md_to_txt(md_file, txt_file)

Replace example.mdwith the path to the Markdown file you want to convert and example.txtreplace with the path to the html file you want to save.

Convert to txt

If you want to remove the link and save only the plain text, we define a function that md_to_txt()accepts two parameters: md_filethe path of the Markdown file and txt_filethe path of the converted plain text file. The function first open()reads the contents of the Markdown file using the function and splits it line by line into a list of strings str_list. It then iterates through each line in the list, ignoring lines containing specific keywords (such as ![or https), and removing specific text (such as 如下图所示:). Add the processed text to txt_contentthe variable, update the title and category information as needed, write txt_contentto a plain text file under the specified path, and print the conversion completion message.

import os
import re
import markdown2 as mdk

def traverse_dir_files(root_dir, ext=None, is_sorted=True):
    """
    列出文件夹中的文件, 深度遍历
    :param root_dir: 根目录
    :param ext: 后缀名
    :param is_sorted: 是否排序,耗时较长
    :return: [文件路径列表, 文件名称列表]
    """
    names_list = []
    paths_list = []
    for parent, _, fileNames in os.walk(root_dir):
        for name in fileNames:
            if name.startswith('.'):  # 去除隐藏文件
                continue
            if ext:  # 根据后缀名搜索
                if name.endswith(tuple(ext)):
                    names_list.append(name)
                    paths_list.append(os.path.join(parent, name))
            else:
                names_list.append(name)
                paths_list.append(os.path.join(parent, name))
    if not names_list:  # 文件夹为空
        return paths_list, names_list
    # if is_sorted:
    #     paths_list, names_list = sort_two_list(paths_list, names_list)
    print(paths_list)
    return paths_list
def remove_code_blocks(text):
    return re.sub(r'```(.*?)```', '', text, flags=re.DOTALL)

def md_to_txt(md_file, txt_file):
    txt_content = ''
    title = os.path.basename(md_file).replace('.md','').strip()
    with open(md_file, 'r', encoding='utf-8') as f:
        str_list = f.read().splitlines()    
        for md in str_list:
            if '![' in md or 'https' in md:
                continue
            md = md.replace('如下图所示:', '')
            txt_content += md +'
'
            if 'title:' in md:
                title = md.replace('title:','').strip()    
            if 'category:' in md:
                category = md.replace('category:','').strip()    
            title = category + '_' + title
    os.makedirs(os.path.dirname(txt_file), exist_ok=True) # 如果目录不存在则创建目录
    with open(os.path.join(txt_file,title+'.txt'), 'w', encoding='utf-8') as f:
        f.write(txt_content)
        print("转换完成:%s" % (md_file))

Traverse the specified directory

A function is defined readlist(), which is used to traverse all Markdown files in the specified directory and call md_to_txt()the function for conversion. It accepts two parameters: paththe directory path to be traversed, txt_dirand the directory path where the converted plain text files are stored. The function uses dir_util.traverse_dir_files()the method to obtain all file paths with the extension in the directory .mdand stores them in path_listthe list. It then iterates through each file path in the list and attempts to call md_to_txt()the function to convert it. If an exception occurs during conversion, it will print out an error message.

def readlist(path, txt_dir):
    path_list = dir_util.traverse_dir_files(root_dir=path, ext='.md')
    res = []
    for path_str in path_list:
        try:
            md_to_txt(path_str, txt_dir)
        except Exception as e:
            print(path_str + '---------error-----------')
            print(e)

Finally, we can call these two functions in the Python script to convert Markdown to plain text. For example, suppose we have a Markdown file data/tree.mdand we want to convert it to a plain text file and save it to data/txtthe directory. We can write the code like this:

if __name__ == '__main__':
    md_file = r'data\' # Markdown文件路径
    txt_dir = r'data\txt' # 转换后的纯文本文件存放的目录路径
    readlist(md_file, txt_dir)

After running this code, data/txta plain text file tree.mdwith . The text content is the same as the original Markdown file.

Guess you like

Origin blog.csdn.net/luansj/article/details/132029707