pdf is abnormal pit father thing, there are many treatment pdf library, but not perfect.

A, pdfminer3k

pdfminer3k is python3 version pdfminer, mainly for reading text in pdf.

Online there are many pdfminer3k code examples, and read later, just want to Tucao about, too complex, contrary to the python's simple.

Import PDFParser pdfminer.pdfparser from, PDFDocument 
from pdfminer.pdfinterp Import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter Import PDFPageAggregator 
from pdfminer.layout Import LAParams, LTTextBox 
from pdfminer.pdfinterp Import PDFTextExtractionNotAllowed 

path = "test.pdf" 

# with a file object to create a pdf document analyzer 
praser = PDFParser (Open (path, 'rb')) 
# create a PDF document 
DOC = PDFDocument () 
# parser connection with the document object 
praser.set_document (DOC) 
doc.set_parser (praser) 

# provide initialization code 
# If you do not create a password to an empty string 
doc.initialize () 

# detect whether txt document conversion, does not provide ignores 
IF not doc.is_extractable: 
    the raise PDFTextExtractionNotAllowed 
the else:
    # Create PDf Explorer to manage shared resources 
    rsrcmgr = PDFResourceManager () 
    # create a PDF device object 
    laparams = LAParams () 
    Device = PDFPageAggregator (rsrcmgr, laparams = laparams) 
    # create a PDF interpreter objects 
    interpreter = PDFPageInterpreter (rsrcmgr, device ) 

    # loop through the list, each dealing with a page of contents 
    for page in doc.get_pages (): 
        interpreter.process_page (page)                         
        # accept the page LTPage target 
        layout = device.get_result () 
        # here is a LTPage layout objects, this page stored inside various objects parsed 
        # include LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal the like                             
        for X in layout: 
            IF the isinstance (X, LTTextBox): 
                Print (. x.get_text () Strip ())

pdfminer for processing forms a very unfriendly, we can extract the text, but not format:

pdf form Screenshot:

Code running results:

Want this result is not easy to restore the table, plus the many rules will inevitably lead to a decline versatility.

Two, tabula-py

tabula is designed to extract PDF form data, while supporting PDF export to CSV, Excel format, but this tool is written in java, rely java7 / 8. tabula-py it is made of a layer of python packaging, it also depends java7 / 8.

The code is simple:

import tabula

path = 'test.pdf'

df = tabula.read_pdf(path, encoding='gbk', pages='all')
for indexs in df.index:
    print(df.loc[indexs].values)

# tabula.convert_into(path, os.path.splitext(path)[0]+'.csv', pages='all')

Although it is known as a form of professional handling of pdf, but the actual effect is not ye. Or pdf, pdfminer operating results are as follows:

This result is really very embarrassing ah, the first table to identify wrong, there are two tables in pdf there, I did not find how to distinguish table.

Three, pdfplumber

pdfplumber is based on pdf page to handle all text, you can get the page, and provides a single method for extracting form.

import pdfplumber

path = 'test.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    # 获取当前页面的全部文本信息，包括表格中的文字
    # print(page.extract_text())                        

    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
        print('---------- 分割线 ----------')

pdf.close()

得到的 table 是个 string 类型的二维数组，这里为了跟 tabula 比较，按行输出显示。

可以看到，跟 tabula 相比，首先是可以区分表格，其次，准确率也提高了很多，表头的识别完全正确。对于表格中有换行的，识别还不是很正确，但至少列的划分没问题，所以还是能处理的。

import pdfplumber
import re

path = 'test1.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    print(page.extract_text())
    for pdf_table in page.extract_tables():
        table = []
        cells = []
        for row in pdf_table:
            if not any(row):
                # 如果一行全为空，则视为一条记录结束
                if any(cells):
                    table.append(cells)
                    cells = []
            elif all(row):
                # 如果一行全不为空，则本条为新行，上一条结束
                if any(cells):
                    table.append(cells)
                    cells = []
                table.append(row)
            else:
                if len(cells) == 0:
                    cells = row
                else:
                    for i in range(len(row)):
                        if row[i] is not None:
                            cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
        for row in table:
            print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])
        print('---------- 分割线 ----------')

pdf.close()

经过处理后，运行得到结果：