python3 pdf转成txt

这两天在做一个pdf转txt文档的工作，真是搞的头大，pdf本来就不好搞，还要转成txt文档，在网上和github上查了一番资料，好的是关于python转txt文档的还不少，不好的是大都是关于python2.x版本的，无奈。。。。。

不知道大家是不是很长时间没有关注这个功能了，大都停留在2.x版本，然而python2就要快不用了，现在python3.x版本这么流行，因此，我就把python2版本的整合了一下，改成了python3版本的，希望能够对这方面有需求的童鞋能够有所帮助。

下面介绍一下解析pdf主要用到的一个python包pdfminer

由于解析PDF是一件非常耗时和内存的工作，因此PDFMiner使用了一种称作lazy parsing的策略，只在需要的时候才去解析，以减少时间和内存的使用。要解析PDF至少需要两个类：PDFParser 和 PDFDocument，PDFParser 从文件中提取数据，PDFDocument保存数据。另外还需要PDFPageInterpreter去处理页面内容，PDFDevice将其转换为我们所需要的。PDFResourceManager用于保存共享内容例如字体或图片。

LTPage
Represents an entire page. May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.（代表整个页面。可能包含子对象，如LTTextBox，LTFigure，LTImage，LTRect，LTCurve和LTLine。）

LTTextBox
Represents a group of text chunks that can be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects. get_text() method returns the text content.（表示可以包含在矩形区域中的一组文本块。请注意，此框由几何分析创建，不一定代表文本的逻辑边界。它包含LTTextLine对象的列表。 get_text（）方法返回文本内容。）

LTTextLine
Contains a list of LTChar objects that represent a single text line. The characters are aligned either horizontaly or vertically, depending on the text’s writing mode. get_text() method returns the text content.（包含表示单个文本行的LTChar对象列表。根据文本的书写模式，字符可以水平或垂直对齐。 get_text（）方法返回文本内容。）

LTAnno
Represent an actual letter in the text as a Unicode string. Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are “virtual” characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).（将文本中的实际字母表示为Unicode字符串。注意，虽然LTChar对象具有实际边界，但LTAnno对象不会，因为这些是“虚拟”字符，由布局分析器根据两个字符（例如空格）之间的关系插入。）

LTFigure
Represents an area used by PDF Form objects. PDF Forms can be used to present figures or pictures by embedding yet another PDF document within a page. Note that LTFigure objects can appear recursively.（表示PDF表单对象使用的区域。 PDF表单可用于通过在页面中嵌入另一个PDF文档来呈现图形或图片。请注意，LTFigure对象可以递归显示。）

LTImage
Represents an image object. Embedded images can be in JPEG or other formats, but currently PDFMiner does not pay much attention to graphical objects.（表示图像对象。嵌入的图像可以是JPEG或其他格式，但目前PDFMiner并不太关注图形对象。）

LTLine
Represents a single straight line. Could be used for separating text or figures.（代表一条直线。可用于分隔文字或图形。）

LTRect
Represents a rectangle. Could be used for framing another pictures or figures.（表示一个矩形。可用于构图其他图片或图形。）

LTCurve
Represents a generic Bezier curve.（表示通用贝塞尔曲线。）

大家看完具体的函数之后，现在奉上我自己搞的一个pdf转txt文件的程序，就在我的github上，大家可以去看一看，肯定是会对大家有用的
pdf转txt程序地址：https://github.com/xunfeiniao/Python-Pdfminer

猜你喜欢