Use python to make a format conversion program for pdf

Use python for pdf format conversion

foreword

When using text editing software, I often want to convert PDF to image or document format, but most of them either have to pay or come with some bundles, so it is better to write it yourself. Here pdf2image and pdf2docx
are used for conversion.
insert image description here

two key functions

transfer picture

def file2Pic():
    global i, pdf_name, file_format
    if pdf_name == '':
        tk.messagebox.showwarning(message='请选择需要导出文件')
    else:
        if var == 0:
            tk.messagebox.showwarning(message='请选择需要导出文件类型')
            print(2)
        else:
            i = 0
            total_file = pdf_name[:-4]
            mkdir(total_file)
            pages = convert_from_path(pdf_name, 500)
            for page in pages:
                file_name = total_file + '/' + str(i) + file_format
                page.save(file_name, 'JPEG')
                i += 1
            counter_l.config(text='转换完成')

Among them, file_format is the export format. In order to save trouble, the save format is not selected, but the file name is directly stored according to the suffix to achieve the format that can be saved as jpg, bmp, png.

to docx

This part uses the open source project pdf2docx , you can see how to use it here.
This open source project can implement several functions of pdf2docx as
paragraph and text styles

段落对齐方式(左/右/居中/分散)及段间距
水平(自左向右)或竖直(自底向上)方向的文本
字体样式(颜色、字体、大小、粗/斜体)
文本样式(高亮、下划线、删除线、超链接)
但对列表样式的识别效果欠佳

picture

段落内嵌入型图片
衬于文本下方的浮动型图片
支持Gray/RGB/CMYK等颜色模式及透明背景图片

Tables and their styles

边框样式(粗细、颜色)
单元格背景色
合并的单元格
隐藏部分边框的表格(例如三线表)
嵌套表格

Support multi-process parallel processing

def file2Docx():
    global pdf_name
    cv = Converter(pdf_name)
    total_file = pdf_name[:-4]
    mkdir(total_file)
    docx_name = total_file + '/PDF2Docx.docx'
    cv.convert(docx_name, start=0, end=None)
    cv.close()

Finally, a simple drawing of the gui interface is performed. The basic functions can be realized, but when it is turned on again depends on the speed of the machine.

A gitee project warehouse has been built here , hoping to get your guidance or joint development.

Guess you like

Origin blog.csdn.net/qq_44879321/article/details/124758488