Python operation pdf (pdfplumber reads PDF and writes it to Excel)

1. Python operates pdf (pdfplumber reads PDF and writes it to Excel)

1.1 Install the pdfplumber module library:

安装pdfplumber: pip install pdfplumber
复制代码

pdfplumber.PDF class

The pdfplumber.PDF class represents a single PDF and has two main properties:

Attributes illustrate
pdf.metadata Get a dictionary of metadata key/value pairs from the PDF's Info. Usually includes "CreationDate", "ModDater", "Producer", etc.
pdf.pages Returns a list containing pdfplumber.Page instances, each instance representing information on each page of the PDF

pdfplumber. Page class

Common properties of pdfplumber.Page class

Attribute page_number illustrate
.page_ number Sequential page numbers, starting from 1 for the first page, starting with 2 for the second page, and so on
.width page width
.height page height
.objects/ . chars/ .lines/ .rects/ . curves/ .figures/ . images Each of these properties is a list, and each list contains a dictionary for each such object embedded on the page, see "Objects" below for details.

common method

method name illustrate
.extract_ text( ) It is used to extract the text in the page and organize all the character objects of the page into the string
.extract_ words( ) All words and their related information are returned
. extract_ tables() Extract the form of the page
.to_ _image() When used for visual debugging, returns an instance of the Pageimage class
.close() By default, the Page object caches its layout and object information to avoid reprocessing it, but when parsing large PDFs, these cached properties can require a lot of memory. You can use this method to flush cache and free up memory.

1.2 Common operations

PDF is the abbreviation of Portable Document Format, and this type of file is usually used .pdfas its extension. In daily development work, the two tasks that are most likely to be encountered are reading text content from PDF and generating PDF documents with existing content.

1.读取pdf文档信息
2.输出总页数
3.读取第一页宽度、高度等信息
4.读取文本第一页
​
加载pdf
  pdfplumber.open( "路径/文件名. pdf".pas sword="test "laparams={ "line_ _overlap'”0.7 })
     password : 要加载受密码保护的PDF ,请传递password关键字参数
     laparams :要将布局分析参数设置为pdfminer. six的布局引擎,请传递laparams关键字参数
复制代码

1.2.1 Python reading pdf file case

The pdf file is as follows

1.2.2 Python code to read pdf files

import pdfplumber
​
# 加载pdf
path = "C:/Users/Administrator/Desktop/test08/test11 - 多页.pdf"
with pdfplumber.open(path) as pdf:
    print(pdf)
    print(type(pdf))
​
    # 读取pdf文档信息
    print("pdf文档信息:", pdf.metadata)
​
    # 输出总页数
    print("pdf文档总页数:", len(pdf.pages))
​
    # 1.读取第一页宽度、高度等信息
    first_page = pdf.pages[0]  # pdfplumber.Page对象第一页
    # 查看页码
    print('pdf页码:', first_page.page_number)
    # 查看页宽
    print('pdf页宽:', first_page.width)
    # 查看页高
    print('pdf页高:', first_page.height)
​
    # 2.读取文本第一页
    first_page = pdf.pages[0]  # pdfplumber.Page对象第一页
    text = first_page.extract_text()
    print(text)
​
复制代码
执行结果:
"D:\Program Files1\Python\python.exe" D:/Pycharm-work/pythonTest/打卡/0811读取pdf.py
<pdfplumber.pdf.PDF object at 0x0000000002846278>
<class 'pdfplumber.pdf.PDF'>
pdf文档信息: {'Author': '', 'Comments': '', 'Company': '', 'CreationDate': "D:20220812102327+02'23'", 'Creator': 'WPS 表格', 'Keywords': '', 'ModDate': "D:20220812102327+02'23'", 'Producer': '', 'SourceModified': "D:20220812102327+02'23'", 'Subject': '', 'Title': '', 'Trapped': 'False'}
pdf文档总页数: 2
pdf页码: 1
pdf页宽: 595.25
pdf页高: 841.85
姓名 年龄 性别 地址 学习技能
张三 20 女 北京 python
李四 25 男 深圳 java
赵五 28 男 上海 C++
孙六 23 女 广州 python
钱七 27 男 珠海 python
张101 20 女 北京 python
.......
.......
张150 27 男 珠海 python
张151 20 女 北京 python
张152 25 男 深圳 java
​
Process finished with exit code 0
​
​
复制代码

1.2.3 Python reads pdf files and stores them in Excel code

import pdfplumber
import xlwt
​
# 加载pdf
path = "C:/Users/Administrator/Desktop/test08/test11 - 多页.pdf"
with pdfplumber.open(path) as pdf:
    page_1 = pdf.pages[0]  # pdf第一页
    table_1 = page_1.extract_table()  # 读取表格数据
    print(table_1)
    # 1.创建Excel对象
    workbook = xlwt.Workbook(encoding='utf8')
    # 2.新建sheet表
    worksheet = workbook.add_sheet('Sheet1')
    # 3.自定义列名
    clo1 = table_1[0]
    # 4.将列表元组clo1写入sheet表单中的第一行
    for i in range(0, len(clo1)):
        worksheet.write(0, i, clo1[i])
    # 5.将数据写进sheet表单中
    for i in range(0, len(table_1[1:])):
        data = table_1[1:][i]
        for j in range(0, len(clo1)):
            worksheet.write(i + 1, j, data[j])
    # 保存Excel文件分两种
    workbook.save('test88.xls')
复制代码

Results of the:

Guess you like

Origin blog.csdn.net/weixin_73136678/article/details/128793909