[Turn] Python code that will teach you to batch PDF into Word

Very often found many documents are in PDF format, PDF format is not conducive to learning to use, so you need to convert PDF files to Word, but perhaps you download a lot of software from the Internet in the study, but only convert the first five pages (such as WPS etc.), or is subject to charges, and that there is no free conversion software?

 

so, we bring to you a free simple and fast way, taught you how to use batch processing PDF format file Python, get the content they want, save it as word form.

 

Before you implement PDF to Word function, we need to write a python and operational environment while installed related libraries. For python environment, we recommend using PyCharm. 

 

PDF to Word features required dependencies as follows:

    • PDFParser (document analyzer)

    • PDFDocument (document object)

    • PDFResourceManager (Explorer)

    • PDFPageInterpreter (interpreter)

    • PDFPageAggregator (aggregator)

    • LAParams (parameter analyzer)

step:

Installation pdfminer3k module

 

Code:

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/usr/bin/env python
# Version = 3.5.2
# __auth__ = '无名小妖'
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
from docx import Document
 
document = Document()
 
 
def parse():
     # rb以二进制读模式打开本地pdf文件
     fn = open ( 'Django-日志配置.md.pdf' , 'rb' )
     # 创建一个pdf文档分析器
     parser = PDFParser(fn)
     # 创建一个PDF文档
     doc = PDFDocument()
     # 连接分析器 与文档对象
     parser.set_document(doc)
     doc.set_parser(parser)
 
     # 提供初始化密码doc.initialize("lianxipython")
     # 如果没有密码 就创建一个空的字符串
     doc.initialize("")
     # 检测文档是否提供txt转换,不提供就忽略
     if not doc.is_extractable:
         raise PDFTextExtractionNotAllowed
 
     else :
         # 创建PDf资源管理器
         resource = PDFResourceManager()
         # 创建一个PDF参数分析器
         laparams = LAParams()
         # 创建聚合器,用于读取文档的对象
         device = PDFPageAggregator(resource,laparams = laparams)
         # 创建解释器,对文档编码,解释成Python能够识别的格式
         interpreter = PDFPageInterpreter(resource,device)
         # 循环遍历列表,每次处理一页的内容
         # doc.get_pages() 获取page列表
         for page in doc.get_pages():
             # 利用解释器的process_page()方法解析读取单独页数
             interpreter.process_page(page)
             # 使用聚合器get_result()方法获取内容
             layout = device.get_result()
             # 这里layout是一个LTPage对象,里面存放着这个page解析出的各种对象
             for out in layout:
                 # 判断是否含有get_text()方法,获取我们想要的文字
                 if hasattr (out, "get_text" ):
                     # print(out.get_text(), type(out.get_text()))
                     content = out.get_text().replace(u '\xa0' , u ' ' # 将'\xa0'替换成u' '空格,这个\xa0就是&nbps空格
                     # with open('test.txt','a') as f:
                     #     f.write(out.get_text().replace(u'\xa0', u' ')+'\n')
                     document.add_paragraph(
                         content, style = 'ListBullet'    # 添加段落,样式为unordered list类型
                     )
                 document.save( 'demo1.docx' # 保存这个文档
 
 
if __name__ = = '__main__' :
     parse()

  

 


---------------------
作者:无名小妖
来源:CNBLOGS
原文:https://www.cnblogs.com/wumingxiaoyao/p/8460973.html
版权声明:本文为作者原创文章,转载请附上博文链接!

Guess you like

Origin www.cnblogs.com/vilogy/p/12333925.html