pdfVarious practical codes for processing PDF: PyPDF2, PDFMiner, pdfplumber

If you don’t know how to arrange your life, there will be many people who will help you arrange things they need you to do.

We often use PDF files, especially in these two scenarios:

  • Download reference materials, such as various reports and documents

    Share read-only information to facilitate dissemination while keeping source files

Scenes and modules

Therefore, for PDF files, there are two common requirements:

  1. Processing the file itself is a page-level operation of the file, such as merging/split PDF pages, encryption/decryption, and watermarking;

    Processing file content belongs to content-level operations, such as extracting text, table data, and charts.

Currently, there are three main modules used by Python to process PDF:

  • PyPDF2: The module is mature, the last update was 2 years ago, suitable for page-level operations, and the text extraction effect is poor.

    PDFMiner: good at text extraction, currently the main branch has stopped maintenance, replaced by pdfminer.six

    pdfplumber: A text content extraction tool based on pdfminer.six, with a lower barrier to use, such as supporting table extraction.

In actual combat, modules can be selected according to the type of requirements. If it is a page-level operation, use PyPDF2. If you need content extraction, use pdfplumber first.

The corresponding module installation:

  pip install pypdf2

  pip install pdfminer.six

  pip install pdfplumber



The following is a demonstration of the use of 3 modules according to usage scenarios.

PyPDF2

The main capabilities of PyPDF2 operate at the page level, such as:

  • Get basic information of PDF documents

    PDF split and merge

    PDF rotation and sorting

    PDF add watermark and remove watermark

    PDF encryption and decryption

The core two classes of PyPDF2 are PdfFileReader and PdfFileWriter, which complete the read and write operations of PDF files.

Get basic information of PDF documents

import pathlib
from PyPDF2 import PdfFileReader
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
with open(f_path, 'rb') as f:
    pdf = PdfFileReader(f)
    info = pdf.getDocumentInfo()
    cnt_page = pdf.getNumPages()
    is_encrypt = pdf.getIsEncrypted()
print(f'''
作者: {info.author}
创建者: {info.creator}
制作者: {info.producer}
主题: {info.subject}
标题: {info.title}
总页数: {cnt_page}
是否加密: {is_encrypt}
''')


PDF split and merge

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_split_merge.pdf')
out_path_1 = path.joinpath('002pdf_split_half_front.pdf')
out_path_2 = path.joinpath('002pdf_split_half_back.pdf')
# 把文件分为两半
with open(f_path, 'rb') as f, open(out_path_1, 'wb') as f_out1, open(out_path_2, 'wb') as f_out2:
    pdf = PdfFileReader(f)
    pdf_out1 = PdfFileWriter()
    pdf_out2 = PdfFileWriter()
    cnt_pages = pdf.getNumPages()
    print(f'共 {cnt_pages} 页')
    for i in range(cnt_pages):
        if i <= cnt_pages //2:
            pdf_out1.addPage(pdf.getPage(i))
        else:
            pdf_out2.addPage(pdf.getPage(i))
    pdf_out1.write(f_out1)
    pdf_out2.write(f_out2)
# 再把后半个文件与前半个文件合并,后半个文件在前
with open(out_path, 'wb') as f_out:
    cnt_f, cnt_b = pdf_out1.getNumPages(), pdf_out2.getNumPages()
    pdf_out = PdfFileWriter()
    for i in range(cnt_b):
        pdf_out.addPage(pdf_out2.getPage(i))
    for i in range(cnt_f):
        pdf_out.addPage(pdf_out1.getPage(i))
    pdf_out.write(f_out)


PDF rotation and sorting

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_rotate.pdf')
with open(f_path, 'rb') as f, open(out_path, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_out = PdfFileWriter()
    page = pdf.getPage(0).rotateClockwise(90)
    pdf_out.addPage(page)
    # 把第二页放到前面
    pdf_out.addPage(pdf.getPage(2))
    page = pdf.getPage(1).rotateCounterClockwise(90)
    pdf_out.addPage(page)
    pdf_out.write(f_out)


PDF add watermark and remove watermark

Adding a picture watermark is actually adding a picture with a transparent background to the page, which can be completed by the mergePage method of the page.

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
wm_path = path.joinpath('watermark.pdf')
en_path = path.joinpath('002pdf_with_watermark_en.pdf')
out_path = path.joinpath('002pdf_with_watermark.pdf')
with open(f_path, 'rb') as f, open(wm_path, 'rb') as f_wm, open(out_path, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_wm = PdfFileReader(f_wm)
    pdf_out = PdfFileWriter()
    wm_cn_page = pdf_wm.getPage(0)
    wm_en_page = pdf_wm.getPage(1)
    cnt_pages = pdf.getNumPages()
    for i in range(cnt_pages):
        page = pdf.getPage(i)
        page.mergePage(wm_cn_page)
        pdf_out.addPage(page)
    pdf_out.write(f_out)


De-watermarking is more complicated and requires specific analysis according to different situations. Because watermarks may be text, pictures or various combinations, the key is to identify features.

3 common ideas for watermark removal:

  • Replace after finding the characteristic word, suitable for English documents, but not suitable for CJK characters such as Chinese.

    After converting the PDF page into a picture, the image algorithm is used to remove the watermark, but this will destroy the original information structure of the file.

    According to the watermark size and location characteristics, all elements are found and deleted. This is the more recommended way.

The third method works best, but if you encounter some complicated document watermarks, it will test patience.

You have to identify the operation commands one by one, and check the effect while replacing it until the watermark is successfully removed.

However, not all remaining pages can be eliminated with the same feature mode, because this PDF may have been watermarked by multiple people and already contains multiple watermarking methods.

Therefore, there is no 100% safe, effective (not bad to delete information) and universal method for watermarking.

Adding and removing watermark is essentially an offensive and defensive strategy.

For example, some tools have introduced a watermark removal function. Once it is made public, the watermarking party can identify and avoid its removal method.

Finally, respect for copyright is everyone's attitude.

In addition to learning, in formal use, the rules of the content creator should be followed.

PDF encryption and decryption

The password in PDF is divided into user password and owner password.

PyPDF2 provides a basic encryption function, "guard against gentlemen but not villains".

If a new file is copied after opening the PDF file, the new file is not restricted by the owner's password and can be modified.

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path_encrypt = path.joinpath('002pdf_encrypt.pdf')
out_path_decrypt = path.joinpath('002pdf_decrypt.pdf')
with open(f_path, 'rb') as f, open(out_path_encrypt, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_out = PdfFileWriter()
    cnt_pages = pdf.getNumPages()
    for i in range(cnt_pages):
        page = pdf.getPage(i)
        pdf_out.addPage(page)
    pdf_out.encrypt('123456', owner_pwd='654321')
    pdf_out.write(f_out)
# 重新读取加密文件并生成解密文件
with open(out_path_encrypt, 'rb') as f, open(out_path_decrypt, 'wb') as f_out:
    pdf = PdfFileReader(f)
    if not pdf.isEncrypted:
        print('文件未被加密')
    else:
        success = pdf.decrypt('123456')
        # if not success:
        pdf_out = PdfFileWriter()
        pdf_out.appendPagesFromReader(pdf)
        pdf_out.write(f_out)


pdfminer.six

The operating threshold of PDFMiner is relatively high. It is necessary to partially understand the document structure model of PDF, which is suitable for custom development of complex content processing tools.

It is relatively rare to use PDFMiner directly. Here we only demonstrate the basic document content operations:

import pathlib
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage
from pdfminer.converter import PDFPageAggregator
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
with open(f_path, 'rb') as f:
    parser = PDFParser(f)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            # 获取文本对象
            if isinstance(x, LTTextBox):
                print(x.get_text().strip())
            # 获取图片对象
            if isinstance(x,LTImage):
                print('这里获取到一张图片')
            # 获取 figure 对象
            if isinstance(x,LTFigure):
                print('这里获取到一个 figure 对象')


Although the threshold for using pdfminer is high, it is necessary to use it in the end when encountering complicated situations. Among the current open source modules, it should have the most comprehensive support for PDF.

The following pdfplumber is a module developed based on pdfminer.six, which reduces the threshold for use.

pdfplumber

Compared with pdfminer.six, pdfplumber provides a more convenient PDF content extraction interface.

Common operations in daily work, such as:

  • Extract PDF content and save to txt file

    Extract tables in PDF to Excel

    Extract images from PDF

    Extract charts in PDF

Extract PDF content and save to txt file

import pathlib
import pdfplumber
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_out.txt')
with pdfplumber.open(f_path) as pdf, open(out_path ,'a') as txt:
    for page in pdf.pages:
        textdata = page.extract_text()
        txt.write(textdata)


Extract tables in PDF to Excel

import pathlib
import pdfplumber
from openpyxl import Workbook
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_excel.xlsx')
wb = Workbook()
sheet = wb.active
with pdfplumber.open(f_path) as pdf:
    for i in range(19, 22):
        page = pdf.pages[i]
        table = page.extract_table()
        for row in table:
            sheet.append(row)
wb.save(out_path)

The above uses the function of openpyxl to create an Excel file, and a separate article will introduce it later.

Extract images from PDF

import pathlib
import pdfplumber
from PIL import Image
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-疫情影响下的中国社区趋势研究-艾瑞.pdf')
out_path = path.joinpath('002pdf_images.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[10]
    # for img in page.images:
    im = page.to_image()
    im.save(out_path, format='PNG')
    imgs = page.images
    for i, img in enumerate(imgs):
        size = img['width'], img['height']
        data = img['stream'].get_data()
        out_path = path.joinpath(f'002pdf_images_{i}.png')
        with open(out_path, 'wb') as fimg_out:
            fimg_out.write(data)

The PIL (Pillow) function is used to process pictures above.

Extract charts in PDF

Charts are different from images and refer to data generated graphs like histograms and pie charts.

import pathlib
import pdfplumber
from PIL import Image
path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_figures.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[7]
    im = page.to_image()
    im.save(out_path, format='PNG')
    figures = page.figures
    for i, fig in enumerate(figures):
        size = fig['width'], fig['height']
        crop = page.crop((fig['x0'], fig['top'], fig['x1'], fig['bottom']))
        img_crop = crop.to_image()
        out_path = path.joinpath(f'002pdf_figures_{i}.png')
        img_crop.save(out_path, format='png')
    im.draw_rects(page.extract_words(), stroke='yellow')
    im.draw_rects(page.images, stroke='blue')
    im.draw_rects(page.figures)
im # show in notebook

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/112847712
Recommended