Convert PDF to WORD using Python

1. Install pdfminer

PDFMiner is a tool for extracting information from PDF documents. pdfminer3k is a Python 3 port of pdfminer.

pip install pdfminer3k

2. Read the content of the PDF file

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter,process_pdf
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
from docx import Document
document = Document()
import warnings
warnings.filterwarnings("ignore")
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from urllib.request import urlopen
import pandas as pd

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdfFile)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content
def save_to_file(file_name, contents):
    fh = open(file_name, 'w')
    fh.write(contents)
    fh.close()

save_to_file('mobiles.txt', 'your contents str')


def main():
    pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
    outputString = readPDF(pdfFile)    #c.word
    save_to_file('c.csv',outputString)
if __name__ == '__main__':
    main()

3. Install Python DocX

Python DocX is currently a part of Python OpenXML, you can use it to open Word 2007 and later documents, and the documents saved with it can be used in Microsoft Office 2007/2010, Microsoft Mac Office 2008, Google Docs, OpenOffice.org 3, and Apple Open in iWork 08.

pip install python_docx

安装经常报错,
ERROR: Exception:
Traceback (most recent call last):
File “c:\users\l\appdata\local\programs\python\python37\lib\site-packages\pip_vendor\resolvelib\resolvers.py”, line 171, in _merge_into_criterion
crit = self.state.criteria[name]
KeyError: ‘python-docx’
During handling of the above exception, another exception occurred:

Solution:

Download the python-docx installation package directly

https://pypi.org/project/python-docx/#files

pip install ./downloads/python-docx-0.8.10.tar.gz

Among them, ./downloads/python-docx-0.8.10.tar.gz represents the real path of the downloaded python-docx-0.8.10.tar.gz file.

For example, the downloaded python-docx-0.8.10.tar.gz file is in the C drive, you should use the command:

pip install C:\python-docx-0.8.10.tar.gz

python-docx-0.8.10 requires lxml>=2.3.2, so if the lxml version is wrong, you also need to upgrade the lxml version

4. Use DocX to save Word

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
from docx import Document
document = Document()
import warnings
warnings.filterwarnings("ignore")
import os
file_name=os.open('a.pdf',os.O_RDWR )

def main():

    fn = open(file_name,'rb')
    parser = PDFParser(fn)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)
    resource = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(resource,laparams=laparams)
    interpreter = PDFPageInterpreter(resource,device)
    for i in doc.get_pages():
        interpreter.process_page(i)
        layout = device.get_result()
        for out in layout:
            if hasattr(out,"get_text"):
                content = out.get_text().replace(u'\xa0', u' ') 
                document.add_paragraph(
                    content, style='ListBullet'   
                )
            document.save('a'+'.docx')
    print ('处理完成')
 
if __name__ == '__main__':
    main()

Guess you like

Origin blog.csdn.net/weixin_47542175/article/details/113856670