Remove watermark in scanned PDF with Python

Content overview

A watermark scanned PDF file, one of which is as shown in the figure below, use Python to remove the watermark at the top and bottom of the page.
insert image description here
Processing idea: The relative position of the watermark on each page in the PDF is basically the same. Output each page of the PDF as a picture, then edit the picture, fill the square with white to cover the watermark, and finally recompose the processed pictures into a PDF file.

Disadvantages of this method:

  1. The size of the PDF file obtained after processing is much larger than the original file.
  2. The kind of PDF that can also extract text, after processing with this method, the PDF file can no longer extract text.
  3. Difficult to deal with watermarks mixed in text.

side effect:

  1. Disable printing PDF becomes printable.

Previous steps:

  1. Install the poppler software , and add the path of the folder where its executable file is located to the environment variable Path (for Windows systems).
  2. pip install pdf2image
  3. pip install fpdf

Grading over

Export PDF to image file set

from pdf2image import convert_from_path
from PIL import ImageDraw

# 100dpi对应文档的长宽及水印左上右下坐标
filePath = "a.pdf"
dpi = 100 #
watermark1 = (290, 47, 536, 66)
watermark2 = (283, 1072, 542, 1165)
gWidth = 827
gHeight = 1170
###########

dpi2 = 150 # 按需调整这个参数<===============
pages = convert_from_path(filePath, dpi2)
width, height = pages[0].size

# 方便多次调整,不用次次用系统自带绘图软件获得坐标位置
# 长宽貌似不与dpi成比例关系
watermark1 = (watermark1[0] * width / gWidth, watermark1[1] * height / gHeight, \
watermark1[2] * width / gWidth, watermark1[3] * height / gHeight)

watermark2 = (watermark2[0] * width / gWidth, watermark2[1] * height / gHeight, \
watermark2[2] * width / gWidth, watermark2[3] * height / gHeight)

print(watermark1)
print(watermark2)

num = 0
for page in pages:

    draw = ImageDraw.Draw(page)
    
    # 水印涂白,可用系统自带绘图软件获得坐标位置
    draw.rectangle(watermark1, fill = 'white')
    draw.rectangle(watermark2, fill = 'white')
    outPath = 'out/%d.jpg' % num
    
    print(outPath)
    page.save(outPath, 'JPEG')
    num = num + 1

Combine processed images into PDF files

from fpdf import FPDF
from PIL import Image
import os,re

def makePdf(pdfFileName, listPages):

	cover = Image.open(listPages[0])
	width, height = cover.size

	pdf = FPDF(unit = "pt", format = [width, height])

	listPages.sort(key = lambda i : int(re.compile(r'(\d+)').search(i).group(1)))

	for page in listPages:
		print(page)
		pdf.add_page()
		pdf.image(page, 0, 0)

	pdf.output(pdfFileName, "F")

makePdf("result.pdf", ["out/"+imgFileName for imgFileName in os.listdir('out') \
					   if imgFileName.endswith("jpg")])

References

  1. Combine multiple images into one PDF file with Python
  2. GitHub - Belval/pdf2image: A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

Guess you like

Origin blog.csdn.net/u011863024/article/details/123321179