【PYTHON,PDF】1.利用python.pypdf2 进行文字表格提取

0.安装模块

window: pip insta pypdf2
	     pip install pdfplumber
mac:    pip3 insta pypdf2
	     pip3 install pdfplumber

若错误可

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfplumber

1.提取pdf文字内容

提取一页:

import pdfplumber

with pdfplumber.open("1.pdf") as pdf:
	page1 = pdf.pages[0]
	print(page1.extract_text())
附件 1 2020 年第 33 届国际青年物理学家锦标赛(IYPT2020)赛题
1. Invent Yourself  the relevant parameters.
Design an instrument for measuring  3.  摇摆的声管
current using its heating effect. What are  声管是一种玩具,由波纹塑料管
the accuracy, precision and limits of the  组成,你可以旋转声管产生声音。研究
method?  这些玩具发出的声音的特性,以及它
1.  你来发明  们如何受到相关参数的影响。
设计一种利用热效应测量电流的
仪器。该方法的准确度、精密度和局限 4. Singing Ferrite
性是什么?  Insert a ferrite rod into a coil fed
  from  a  signal  generator.  At  some
2. Inconspicuous Bottle  frequencies the rod begins to produce a
Put a lit candle behind a bottle. If  sound. Investigate the phenomenon.
you blow on the bottle from the opposite  4. “歌神”铁氧体
side, the candle may go out, as if the  将铁氧体棒插入信号发生器供电
bottle was not there at all. Explain the  的线圈中。在某些频率下,铁氧体棒开
phenomenon.  始发出声音。研究这一现象。
2.  不起眼的瓶子
将点燃的蜡烛放在瓶子后面。如 5. Sweet Mirage
果你从蜡烛的对面吹瓶子,蜡烛同样 Fata Morgana is the name given to a
可能熄灭,好像瓶子根本不在那里。解 particular  form  of  mirage.  A  similar
释这个现象。  effect can be produced by shining a laser
  through a fluid with a refractive index
3. Swing Sound Tube  gradient. Investigate the phenomenon.
A Sound Tube is a toy, consisting of  5.  甜蜜的海市蜃楼
a corrugated plastic tube, that you can  法塔莫干纳是一种特殊形式的海
spin around to produce sounds. Study the  市蜃楼的名字。而使用激光照射具有
characteristics of the sounds produced by  折射率梯度的流体时,也会产生类似
such toys, and how they are affected by  的效果。研究这一现象。

提取全部文字:

import pdfplumber

with pdfplumber.open("1.pdf") as pdf:
for pages in pdf.pages:
		print(pages.extract_text())
附件 1 2020 年第 33 届国际青年物理学家锦标赛(IYPT2020)赛题
1. Invent Yourself  the relevant parameters.
Design an instrument for measuring  3.  摇摆的声管
current using its heating effect. What are  声管是一种玩具,由波纹塑料管
the accuracy, precision and limits of the  组成,你可以旋转声管产生声音。研究
method?  这些玩具发出的声音的特性,以及它
1.  你来发明  们如何受到相关参数的影响。
设计一种利用热效应测量电流的
仪器。该方法的准确度、精密度和局限 4. Singing Ferrite
性是什么?  Insert a ferrite rod into a coil fed
  from  a  signal  generator.  At  some
2. Inconspicuous Bottle  frequencies the rod begins to produce a
Put a lit candle behind a bottle. If  sound. Investigate the phenomenon.
you blow on the bottle from the opposite  4. “歌神”铁氧体
side, the candle may go out, as if the  将铁氧体棒插入信号发生器供电
bottle was not there at all. Explain the  的线圈中。在某些频率下,铁氧体棒开
phenomenon.  始发出声音。研究这一现象。
2.  不起眼的瓶子
将点燃的蜡烛放在瓶子后面。如 5. Sweet Mirage
果你从蜡烛的对面吹瓶子,蜡烛同样 Fata Morgana is the name given to a
可能熄灭,好像瓶子根本不在那里。解 particular  form  of  mirage.  A  similar
释这个现象。  effect can be produced by shining a laser
  through a fluid with a refractive index
3. Swing Sound Tube  gradient. Investigate the phenomenon.
A Sound Tube is a toy, consisting of  5.  甜蜜的海市蜃楼
a corrugated plastic tube, that you can  法塔莫干纳是一种特殊形式的海
spin around to produce sounds. Study the  市蜃楼的名字。而使用激光照射具有
characteristics of the sounds produced by  折射率梯度的流体时,也会产生类似
such toys, and how they are affected by  的效果。研究这一现象。
  However,  a  light  particle  may  not
6. Saxon Bowl  penetrate the film and may remain on its
A bowl with a hole in its base will  surface. Investigate the properties of such
sink when placed in water. The Saxons  a membrane filter.
used  this  device  for  timing  purposes.  8.  肥皂膜过滤器
Investigate the parameters that determine  一个重颗粒可以通过一个水平的
the time of sinking.  肥皂膜而不会使其破裂。然而,轻粒子
6.  撒克逊碗  可能无法穿透膜并可能停留在其表面
一个底部有洞的碗放在水中会下 上。研究这种膜过滤器的性能。
沉。撒克逊人用这个装置来计时。研究
决定下沉时间的参数。  9. Magnetic Levitation
  Under  certain  circumstances,  the
7. Balls on a String  “flea” of a magnetic stirrer can rise up
Put a string through a ball with a  and  levitate  stably  in  a  viscous  fluid
hole in it such that the ball can move  during stirring. Investigate the origins of
freely along the string. Attach another  the dynamics tabilization of the “flea”
ball to one end of the string. When you  and  how  it  depends  on  the  relevant
move the free end periodically, you can  parameters.
observe complex movements of the two  9.  磁悬浮
balls. Investigate the phenomenon.  在某些特定情况下,磁力搅拌器
7.  绳子上的球  的“搅拌子”在搅拌时,能在粘性流体中
将绳子穿过一个带有洞的球,这 稳定地上升和悬浮。研究“搅拌子”动态
样球就可以沿着绳子自由移动。把另 稳定的起源,以及它如何依赖相关参
一个球系在绳子的一端。当你周期性 数。
地移动绳子的自由端时,你可以观察
到两个球的复杂运动。研究这一现象。  10. Conducting Lines
  A line drawn with a pencil on paper
8. Soap Membrane Filter  can  be  electrically  conducting.
A heavy particle may fall through a  Investigate  the  characteristics  of  the
horizontal soap film without rupturing it.  conducting line.
10.  画出来的导线  12. 多边形涡流
用铅笔在纸上画的线可以导电。 在瓶面附近装有旋转板的静止圆
研究这种导线特性。  柱形容器中,部分装有液体。在一定条
  件下,液体表面的形状会变成多边形。
11. Drifting Speckles  解释这一现象并研究其对相关参数的
Shine  a  laser  beam  onto  a  dark  依赖性。
surface. A granular pattern can be seen
inside  the  spot.  When  the  pattern  is  13. Friction Oscillator
observed by a camera or the eye, that is  A massive object is placed onto two
moving slowly, the pattern seems to drift  identical  parallel  horizontal  cylinders.
relative  to  the  surface.  Explain  the  The two cylinders each rotate with the
phenomenon  and  investigate  how  the  same  angular  velocity,  but  in  opposite
drift depends on relevant parameters.  directions. Investigate how the motion of
11. 漂移的斑点  the object on the cylinders depends on the
将激光束照射到黑暗的表面上。 relevant parameters.
在斑点内可以看到颗粒状图案。当用 13.  摩擦振子
相机或人眼观察这个图案时,图案似 一个大块的物体被放置在两个相
乎在缓慢移动,图案相对于表面似乎 同的平行水平圆柱体上。两个圆柱各
在漂移。解释现象并研究漂移如何取 自以相同的角速度旋转,但方向相反。
决于相关参数。  研究物体在圆柱体上的运动如何依赖
  于相关参数。
12. Polygon Vortex
A  stationary  cylindrical  vessel  14. Falling Tower
containing a rotating plate near the bottle  Identical discs are stacked one on
surface  is  partially  filled  with  liquid.  top  of  another  to  form  a  freestanding
Under certain conditions, the shape of the  tower. The bottom disc can be removed
liquid  surface  becomes  polygon-like.  by  applying  a  sudden  horizontal  force
Explain this phenomenon and investigate  such that the rest of the tower will drop
the  dependence  on  the  relevant  down  onto  the  surface  and  the  tower
parameters.  remains  standing.  Investigate  the
phenomenon  and  determine  the  distance from each other. If one of the
conditions that allow the tower to remain  pulleys is immersed into hot water, the
standing.  wire  tends  to  straighten,  causing  a
14.  下落的塔  rotation  of the  pulleys.  Investigate the
相同的圆盘,一个叠在另一个上 properties of such an engine.
面,形成一个独立的塔。当塔底部的圆 16.  镍钛合金发动机
盘通过施加一个突然的水平力来移除, 将镍钛合金线圈绕在两个滑轮上,
塔身的其余部分就会掉落到底面上, 同时两个滑轮的轴彼此相距一定距离。
并依然保持直立状态。研究该现象并 如果其中一个滑轮浸入热水中,金属
确定允许塔保持静止直立的条件。  丝就会变直,导致滑轮转动。研究这种
  发动机的性能。
15. Pepper Pot
If you take a salt or pepper pot and  17. Playing Card
just shake it, the contents will pour out  A standard playing card can travel a
relatively slowly. However, if an object is  very long distance provided that spin is
rubbed along the bottom of the pot, then  imparted as it is thrown. Investigate the
the  rate  of  pouring  can  increase  parameters that affect the distance and
dramatically.  Explain  this  phenomenon  the trajectory
and investigate how the rate depends on  17. 玩纸牌
the relevant parameters.  一张标准的扑克牌只要在投掷的
15.  胡椒罐  过程中旋转,就可以运动很长的一段
如果你拿一个盐或胡椒罐,摇晃 距离。研究影响距离和轨迹的参数
罐子,里面的东西就会慢慢地倒出来。
然而,如果一个物体沿罐底摩擦,则倒
出速度会显著增加。解释这种现象,并
研究倒出速度如何依赖于相关参数。

16. Nitinol Engine
Place a nitinol wire loop around two
pulleys with their axes located at some

2.提取表格内容

2.1提取单个表格:
在这里插入图片描述

import pdfplumber
with pdfplumber.open("2.pdf") as pdf:
	table_page1 = pdf.pages[0]
	table = table_page.extract_table()
	print(table)
[['姓名', '性别', '班级', '分数'], ['经济家', '男', '1', '20'], ['鞍达到', '男', '2', '10'], ['阿达', '男', '3', '12'], ['矮袋鼠', '男', '4', '13'], ['安三点', '女', '5', '20'], ['阿大师兄', '女', '6', '20'], ['开口处', '女', '7', '15'], ['', '', '', '']]

我们发现结果以列表显示,这样很方便我们进行操作。

2.2提取多个表格:

for table in table_page.extract_tables():
		print(table)

2.3设定:

with pdfplumber.open("2.pdf") as pdf:
	table_page1 = pdf.pages[0]
	table = table_page.extract_table(
		table_settings={
			"vertical_strategy":"text"
			"horizontal_strategy":"lines"
		})
	print(table)
[['姓名', '性别', '班级', '分数'], ['', '', '', ''], ['经济家', '男', '1', '20'], ['', '', '', ''], ['鞍达到', '男', '2', '10'], ['', '', '', ''], ['阿达', '男', '3', '12'], ['', '', '', ''], ['矮袋鼠', '男', '4', '13'], ['', '', '', ''], ['安三点', '女', '5', '20'], ['', '', '', ''], ['阿大师兄', '女', '6', '20'], ['', '', '', ''], ['开口处', '女', '7', '15']]

3.写入Excel

import pdfplumber
from openpyxl import Workbook
with pdfplumber.open("2.pdf") as pdf:
	table_page1 = pdf.pages[0]
	table = table_page1.extract_table(
		table_settings={
			"vertical_strategy":"text",
			"horizontal_strategy":"text",
		})
	print(table)

workbook = Workbook()
sheet = workbook.active
for row in table:
	sheet.append(row)
workbook.save(filename="4.xlsx")

在这里插入图片描述
4.整理格式

格式整理有很多种情况,下面介绍一种最常见的方法,去空格:

import pdfplumber
from openpyxl import Workbook
with pdfplumber.open("2.pdf") as pdf:
	table_page1 = pdf.pages[0]
	table = table_page1.extract_table()
	print(table)

workbook = Workbook()
sheet = workbook.active
new_row = []
for row in table:
	if not ''.join([str(item) for item in row]) == '':
		sheet.append(row)
	workbook.save(filename="4.xlsx")

在这里插入图片描述
后续将会持续更新excel,ppt,爬虫,人工智能等相关内容,敬请关注

发布了28 篇原创文章 · 获赞 25 · 访问量 2055

猜你喜欢

转载自blog.csdn.net/AI_LINNGLONG/article/details/104312964