1. Python operates pdf (pdfplumber reads PDF and writes it to Excel)
1.1 Install the pdfplumber module library:
安装pdfplumber: pip install pdfplumber
复制代码
pdfplumber.PDF class
The pdfplumber.PDF class represents a single PDF and has two main properties:
Attributes | illustrate |
---|---|
pdf.metadata | Get a dictionary of metadata key/value pairs from the PDF's Info. Usually includes "CreationDate", "ModDater", "Producer", etc. |
pdf.pages | Returns a list containing pdfplumber.Page instances, each instance representing information on each page of the PDF |
pdfplumber. Page class
Common properties of pdfplumber.Page class
Attribute page_number | illustrate |
---|---|
.page_ number | Sequential page numbers, starting from 1 for the first page, starting with 2 for the second page, and so on |
.width | page width |
.height | page height |
.objects/ . chars/ .lines/ .rects/ . curves/ .figures/ . images | Each of these properties is a list, and each list contains a dictionary for each such object embedded on the page, see "Objects" below for details. |
common method
method name | illustrate |
---|---|
.extract_ text( ) | It is used to extract the text in the page and organize all the character objects of the page into the string |
.extract_ words( ) | All words and their related information are returned |
. extract_ tables() | Extract the form of the page |
.to_ _image() | When used for visual debugging, returns an instance of the Pageimage class |
.close() | By default, the Page object caches its layout and object information to avoid reprocessing it, but when parsing large PDFs, these cached properties can require a lot of memory. You can use this method to flush cache and free up memory. |
1.2 Common operations
PDF is the abbreviation of Portable Document Format, and this type of file is usually used .pdf
as its extension. In daily development work, the two tasks that are most likely to be encountered are reading text content from PDF and generating PDF documents with existing content.
1.读取pdf文档信息
2.输出总页数
3.读取第一页宽度、高度等信息
4.读取文本第一页
加载pdf
pdfplumber.open( "路径/文件名. pdf".pas sword="test "laparams={ "line_ _overlap'”0.7 })
password : 要加载受密码保护的PDF ,请传递password关键字参数
laparams :要将布局分析参数设置为pdfminer. six的布局引擎,请传递laparams关键字参数
复制代码
1.2.1 Python reading pdf file case
The pdf file is as follows
1.2.2 Python code to read pdf files
import pdfplumber
# 加载pdf
path = "C:/Users/Administrator/Desktop/test08/test11 - 多页.pdf"
with pdfplumber.open(path) as pdf:
print(pdf)
print(type(pdf))
# 读取pdf文档信息
print("pdf文档信息:", pdf.metadata)
# 输出总页数
print("pdf文档总页数:", len(pdf.pages))
# 1.读取第一页宽度、高度等信息
first_page = pdf.pages[0] # pdfplumber.Page对象第一页
# 查看页码
print('pdf页码:', first_page.page_number)
# 查看页宽
print('pdf页宽:', first_page.width)
# 查看页高
print('pdf页高:', first_page.height)
# 2.读取文本第一页
first_page = pdf.pages[0] # pdfplumber.Page对象第一页
text = first_page.extract_text()
print(text)
复制代码
执行结果:
"D:\Program Files1\Python\python.exe" D:/Pycharm-work/pythonTest/打卡/0811读取pdf.py
<pdfplumber.pdf.PDF object at 0x0000000002846278>
<class 'pdfplumber.pdf.PDF'>
pdf文档信息: {'Author': '', 'Comments': '', 'Company': '', 'CreationDate': "D:20220812102327+02'23'", 'Creator': 'WPS 表格', 'Keywords': '', 'ModDate': "D:20220812102327+02'23'", 'Producer': '', 'SourceModified': "D:20220812102327+02'23'", 'Subject': '', 'Title': '', 'Trapped': 'False'}
pdf文档总页数: 2
pdf页码: 1
pdf页宽: 595.25
pdf页高: 841.85
姓名 年龄 性别 地址 学习技能
张三 20 女 北京 python
李四 25 男 深圳 java
赵五 28 男 上海 C++
孙六 23 女 广州 python
钱七 27 男 珠海 python
张101 20 女 北京 python
.......
.......
张150 27 男 珠海 python
张151 20 女 北京 python
张152 25 男 深圳 java
Process finished with exit code 0
复制代码
1.2.3 Python reads pdf files and stores them in Excel code
import pdfplumber
import xlwt
# 加载pdf
path = "C:/Users/Administrator/Desktop/test08/test11 - 多页.pdf"
with pdfplumber.open(path) as pdf:
page_1 = pdf.pages[0] # pdf第一页
table_1 = page_1.extract_table() # 读取表格数据
print(table_1)
# 1.创建Excel对象
workbook = xlwt.Workbook(encoding='utf8')
# 2.新建sheet表
worksheet = workbook.add_sheet('Sheet1')
# 3.自定义列名
clo1 = table_1[0]
# 4.将列表元组clo1写入sheet表单中的第一行
for i in range(0, len(clo1)):
worksheet.write(0, i, clo1[i])
# 5.将数据写进sheet表单中
for i in range(0, len(table_1[1:])):
data = table_1[1:][i]
for j in range(0, len(clo1)):
worksheet.write(i + 1, j, data[j])
# 保存Excel文件分两种
workbook.save('test88.xls')
复制代码
Results of the: