Python: parse text and use PDF forms --pdfminer, tabula, pdfplumber and contrast Python: read .doc, .docx Word file briefly and two kinds of "Word failed to raise an event" Error

pdf is abnormal pit father thing, there are many treatment pdf library, but not perfect.

A, pdfminer3k

pdfminer3k is python3 version pdfminer, mainly for reading text in pdf.

Online there are many pdfminer3k code examples, and read later, just want to Tucao about, too complex, contrary to the python's simple.

Import PDFParser pdfminer.pdfparser from, PDFDocument 
from pdfminer.pdfinterp Import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter Import PDFPageAggregator 
from pdfminer.layout Import LAParams, LTTextBox 
from pdfminer.pdfinterp Import PDFTextExtractionNotAllowed 

path = "test.pdf" 

# with a file object to create a pdf document analyzer 
praser = PDFParser (Open (path, 'rb')) 
# create a PDF document 
DOC = PDFDocument () 
# parser connection with the document object 
praser.set_document (DOC) 
doc.set_parser (praser) 

# provide initialization code 
# If you do not create a password to an empty string 
doc.initialize () 

# detect whether txt document conversion, does not provide ignores 
IF not doc.is_extractable: 
    the raise PDFTextExtractionNotAllowed 
the else:
    # Create PDf Explorer to manage shared resources 
    rsrcmgr = PDFResourceManager () 
    # create a PDF device object 
    laparams = LAParams () 
    Device = PDFPageAggregator (rsrcmgr, laparams = laparams) 
    # create a PDF interpreter objects 
    interpreter = PDFPageInterpreter (rsrcmgr, device ) 

    # loop through the list, each dealing with a page of contents 
    for page in doc.get_pages (): 
        interpreter.process_page (page)                         
        # accept the page LTPage target 
        layout = device.get_result () 
        # here is a LTPage layout objects, this page stored inside various objects parsed 
        # include LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal the like                             
        for X in layout: 
            IF the isinstance (X, LTTextBox): 
                Print (. x.get_text () Strip ())

 

pdfminer for processing forms a very unfriendly, we can extract the text, but not format:

pdf form Screenshot:

Code running results:

 

Want this result is not easy to restore the table, plus the many rules will inevitably lead to a decline versatility.

 

Two, tabula-py

tabula is designed to extract PDF form data, while supporting PDF export to CSV, Excel format, but this tool is written in java, rely java7 / 8. tabula-py it is made of a layer of python packaging, it also depends java7 / 8.

The code is simple:

import tabula

path = 'test.pdf'

df = tabula.read_pdf(path, encoding='gbk', pages='all')
for indexs in df.index:
    print(df.loc[indexs].values)

# tabula.convert_into(path, os.path.splitext(path)[0]+'.csv', pages='all')

Although it is known as a form of professional handling of pdf, but the actual effect is not ye. Or pdf, pdfminer operating results are as follows:

This result is really very embarrassing ah, the first table to identify wrong, there are two tables in pdf there, I did not find how to distinguish table.

 

Three, pdfplumber

pdfplumber is based on pdf page to handle all text, you can get the page, and provides a single method for extracting form.

import pdfplumber

path = 'test.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    # 获取当前页面的全部文本信息,包括表格中的文字
    # print(page.extract_text())                        

    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
        print('---------- 分割线 ----------')

pdf.close()

得到的 table 是个 string 类型的二维数组,这里为了跟 tabula 比较,按行输出显示。

可以看到,跟 tabula 相比,首先是可以区分表格,其次,准确率也提高了很多,表头的识别完全正确。对于表格中有换行的,识别还不是很正确,但至少列的划分没问题,所以还是能处理的。

import pdfplumber
import re

path = 'test1.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    print(page.extract_text())
    for pdf_table in page.extract_tables():
        table = []
        cells = []
        for row in pdf_table:
            if not any(row):
                # 如果一行全为空,则视为一条记录结束
                if any(cells):
                    table.append(cells)
                    cells = []
            elif all(row):
                # 如果一行全不为空,则本条为新行,上一条结束
                if any(cells):
                    table.append(cells)
                    cells = []
                table.append(row)
            else:
                if len(cells) == 0:
                    cells = row
                else:
                    for i in range(len(row)):
                        if row[i] is not None:
                            cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
        for row in table:
            print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])
        print('---------- 分割线 ----------')

pdf.close()

经过处理后,运行得到结果:

这结果已经完全正确了,而用 tabula,即便是经过处理也是无法得到这样的结果的。当然对于不同的 pdf,可能需要不同的处理,实际情况还是要自己分析。

pdfplumber 也有处理不准确的时候,主要表现在缺列:

我找了另一个 pdf,表格部分截图如下:

解析结果如下:

4列变成了两列,另外,如果表格有合并单元格的情况,也会有这种问题,我挑这个表格展示是因为比较特殊,没有合并单元格也缺列了。这应该跟 pdf 生成的时候有关。

但其实数据是获取完整的,并没有丢,只是被认为是非表格了。输出 page.extract_text() 如下:

 

然后,我又用 tabula 试了下,结果如下:

列是齐了,但是,表头呢???

 

pdfplumber 还提供了图形Debug功能,可以获得PDF页面的截图,并且用方框框起识别到的文字或表格,帮助判断PDF的识别情况,并且进行配置的调整。要使用这个功能,还需要安装ImageMagick。因为没有用到,所以暂时没有去细究。

 

四、后记

我们在做爬虫的时候,难免会遇到 pdf 需要解析,主要还是针对文本和表格的数据提取。而 python 处理 pdf 的库实在是太多太多了,比如还有 pypdf2,网上资料也比较多,但是我试了,读出来是乱码,没有仔细的读源码所以这个问题也没有解决。

而我对比较常用的3个库比较后觉得,还是 pdfplumber 比较好用,对表格的支持最好。

 


相关博文推荐:

Python:读取 .doc、.docx 两种 Word 文件简述及“Word 未能引发事件”错误

ref:https://www.cnblogs.com/gl1573/p/10064438.html

Guess you like

Origin www.cnblogs.com/wind-chaser/p/11264063.html