June 8 Python handle PDF and Word documents commonly used method

PyPDF2

Python module handles PDF and Word document is PyPDF2, we need to import before use.

Open a PDF document sequence of operations is:

Using open () function opens the file and used to receive a variable, and then to pass to the variable PdfFileReader objects, forming a PdfFileReader object so that objects following PdfFileReader by various methods, properties, to operate the PDF document.

PdfFileReader对象方法:
import PyPDF2 
pdfFileObj = open('meetingminutes.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pdfReader.numPages
	>>> 19 
pageObj = pdfReader.getPage(0) 
pageObj.extractText()

Reference - http://copyfuture.com/blogs-details/2f4702509dd5431f7cee8208e768086a

using a recording python-docx

Chinese due to the treatment, it is used here to python3 (python2 encoding a relatively small problem).

安装 docx:使用 pip3 install python-docx
如果安装失败可以尝试:pip3 easy-install python-docx

docx document structure is divided into three layers:

  • Document object represents the entire document
  • Document Paragraph contains a list of objects, Paragraph object is used to represent paragraphs
  • Paragraph object contains a list of objects Run, Run:
  • not only in word strings, as well as size, color, font and other attributes, are included in the style. Run a style object is the same piece of text.
  • Create a Run there is a new style.

Basic Operations

Basic operation including opening a document, the contents written in the document, the document is stored, the following simple example.

from docx import Document
doc=Document() #不填文件名默认新建空白文档。填文件名(必须是已存在的doc文件)将打开这一文档进行操作
doc.add_heading('Hello') #添加标题
doc.add_paragraph('word') #添加段落
doc.save('test.docx') #保存,必须有1个参数

Object python-docx set included the following

doc.paragraphs    #段落集合
doc.tables        #表格集合
doc.sections      #节  集合
doc.styles        #样式集合
doc.inline_shapes #内置图形 等等...
from docx import Document
from docx.shared import Inches

document = Document()

document.add_heading('Document Title', 0)

p = document.add_paragraph('A plain paragraph having some ')
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True

document.add_heading('Heading, level 1', level=1)
document.add_paragraph('Intense quote', style='IntenseQuote')

document.add_paragraph(
    'first item in unordered list', style='ListBullet'
)
document.add_paragraph(
    'first item in ordered list', style='ListNumber'
)

document.add_picture('monty-truth.png', width=Inches(1.25))

table = document.add_table(rows=1, cols=3)
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Qty'
hdr_cells[1].text = 'Id'
hdr_cells[2].text = 'Desc'
for item in recordset:
    row_cells = table.add_row().cells
    row_cells[0].text = str(item.qty)
    row_cells[1].text = str(item.id)
    row_cells[2].text = item.desc

document.add_page_break()

document.save('demo.docx')

Read the title

Background: The title of a document needs to be copied to another document, but the title is too decentralized, manually copy too strenuous, so consider using docx to deal with.

Open the doc documentation for all of paragraphs (which contains the Heading), see the style of these paragraphs (see the titles need to get is a few levels)

import docx
doc=docx.Document('filename.docx') #打开文档

ps=doc.paragraphs
for p in ps:
    print(p.style)

By the results of the above know that in this document (filename.docx), the title of the style include Heading 1, Heading 2, Heading 3 (title other documents may not be the style), we have to match these titles through p.style.name, the title and the level stored in the re use.

re=[]
for p in ps:
    if p.style.name=='Heading 1':
        re.append((p.text,1))
    if p.style.name=='Heading 2':
        re.append((p.text,2))
    if p.style.name=='Heading 3':
        re.append((p.text,3))


Has now acquired the title of the content and title level, will re list "Extract": titles, titledes = zip (* re), the title exists titles list, level presence titledes list, followed by the title of the new document is written

newdoc=docx.Document()
for i in range(len(titles)):
    newdoc.add_heading(titles[i],level=titledes[i])
newdoc.save('newfile.docx')

Get the table of contents

Background: The need to obtain a second column of the table all the documents and the third column content.

Open the document doc

import docx
doc=docx.Document('filename.docx') #打开文档

doc.tables returns form document, rows, columns, and cell objects traverse the table when useful.

Table object has two properties rows and columns, is equivalent to a list of Row and Column list. Thus iteration, the operation of the seek length list is equally applicable to Rows and Columns.

the table cell is commonly used objects, the object can be obtained by the following five methods Cell:

  • cell (row, col) method using a Table object The coordinates of the upper left corner of 0,0
  • Table object using row_cells (row_index) method to obtain a list, which contains a sort of all the columns in a row Cell
  • After obtaining a Row object, using the obtained attribute Row.cells Row Cell sorted by all columns
  • Table object using column_cells (column_index) method to obtain a list, which contains all Cell sorted by row in a column of
  • Cell sorting of all rows after obtaining a Column object, obtaining the property using the Column Column.cells

If you want to traverse all the Cell, you can first over lines (table.rows), and then through all Cell each row; Alternatively, you can traverse all of the columns (table.columns), and then through all Cell each column.

A Cell object attribute is the most commonly used text. Setting this property may be set contents of a cell, reading the property to get the contents of the cell.

For ease of understanding, the following Examples

for table in doc.tables: #列举文档中的表格
    for row in table.rows: #表格中的每一行
        t1=row.cells[1].text #每一行中第2列(从0开始计数)的内容
        t2=row.cells[2].text #每一行中第3列的内容

Deposit with DataFrame After obtaining the data in the table, and finally saved as a csv file. If there are Chinese garbage problem, and finally add encoding = 'gb2312'
df.to_csv('filename.csv',index=False,encoding='gb2312')

Table ## create
the first two rows of the table and parameter settings Document.add_table columns, and the third parameter setting table style, the style may be used to get and set the table style attribute. If you set the style, you can use the English name of the style, such as "Table Grid" directly; if the style has been read, it will get a Style object. This object can be used across documents. In addition, you can use Style.name method gets its name.

Table 6 below creates a row 2 of the table can be filled .text by table.cell (i, j).

doc=docx.Document()
tabel=doc.add_table(rows=6,cols=2,style = 'Table Grid') #实线
tabel.cell(0,0).text='编号'
tabel.cell(1,0).text='位置'

Each column of the table created above width, column width of the table can be provided to make it more attractive.

from docx.shared import Inches
for t in doc.tables:
    for row in t.rows:
        row.cells[0].width=Inches(1)
        row.cells[1].width=Inches(5)

reference

Python word property setting function

office 2007 can not directly open the VB editor, press Alt + F11Alt + F11Alt + F11Alt + F11 to open.

import win32com.client      # 导入脚本模块
WordApp = win32com.client.Dispatch("Word.Application") # 载入WORD模块
WordApp.Visible = True      # 显示Word应用程序        

1, the new Word document

doc=WordApp.Documents.Add()     # 新建空文件   
doc = WordApp.Documents.Open(r"d:\2011专业考试计划.doc") # 打开指定文档
doc.SaveAs(r"d:\2011专业考试计划.doc")  # 文档保存
doc.Close(-1)      # 保存后关闭,doc.Close()或doc.Close(0)直接关闭不保存

2, Page Setup

doc.PageSetup.PaperSize = 7     # 纸张大小, A3=6, A4=7 
doc.PageSetup.PageWidth = 21*28.35    # 直接设置纸张大小, 使用该设置后PaperSize设置取消 
doc.PageSetup.PageHeight = 29.7*28.35        # 直接设置纸张大小  
doc.PageSetup.Orientation = 1                # 页面方向, 竖直=0, 水平=1 
doc.PageSetup.TopMargin = 3*28.35           # 页边距上=3cm,1cm=28.35pt 
doc.PageSetup.BottomMargin = 3*28.35         # 页边距下=3cm 
doc.PageSetup.LeftMargin = 2.5*28.35         # 页边距左=2.5cm
doc.PageSetup.RightMargin = 2.5*28.35        # 页边距右=2.5cm  
doc.PageSetup.TextColumns.SetCount(2)        # 设置页面

3, Formatting

sel = WordApp.Selection       # 获取Selection对象
sel.InsertBreak(8)                # 插入分栏符=8, 分页符=7 
sel.Font.Name = "黑体"                 # 字体
sel.Font.Size = 24                     # 字大
sel.Font.Bold = True                  # 粗体
sel.Font.Italic = True                 # 斜体
sel.Font.Underline = True              # 下划线 
sel.ParagraphFormat.LineSpacing = 2*12   # 设置行距,1行=12磅
sel.ParagraphFormat.Alignment = 1      # 段落对齐,0=左对齐,1=居中,2=右对齐
sel.TypeText("XXXX")       # 插入文字
sel.TypeParagraph()       # 插入空行 
注注注注::::ParagraphFormat属性必须使用TypeParagraph()之后才能二次生效!  

4. Insert Picture

pic = sel.InlineShapes.AddPicture(jpgPathName) # 插入图片,缺省嵌入型
pic.WrapFormat.Type = 0           # 修改文字环绕方式:0=四周型,1=紧密型,3=文字上方,5=文字下方 
pic.Borders.OutsideLineStyle = 1          # 设置图片4边线,1=实线
pic.Borders.OutsideLineWidth = 8          # 设置边线宽度,对应对话框中数值依次2,4,6,8,12,18,24,36,48
pic.Borders(-1).LineStyle = 1             # -1=上边线,-2=左边线,-3下边线,-4=右边线
pic.Borders(-1).LineWidth = 8             # 依次2,4,6,8,12,18,24,36,48 
注注注注::::InlineShapes方式插入图片类似于插入字符(嵌入式),Shapes插入图片缺省是浮动的。

5. Insert Table

tab=doc.Tables.Add(sel.Range, 16, 2)  # 增加一个16行2列的表格 
tab.Style = "网格型"       # 显示表格边框
tab.Columns(1).SetWidth(5*28.35, 0)   # 调整第1列宽度,1cm=28.35pt
tab.Columns(2).SetWidth(9*28.35, 0)   # 调整第2列宽度
tab.Rows.Alignment = 1                    # 表格对齐,0=左对齐,1=居中,2=右对齐
tab.CellCellCellCell(1,1).Range.Text = "xxx"    # 填充内容,注意Excel中使用wSheet.Cells(i,j)   
sel.MoveDown(5, 16)       # 向下移动2行,5=以行为单位 
注注注注::::插入n行表格之后必须使用MoveDown(5,n)移动到表格之后才能进行其它操作,否则报错!

6, using style

for  stl in doc.Styles: 
    print stl.NameLocal   # 显示文档中所有样式名

Python ask win32com, win32con, win32gui detailed help document

Win32com used today but can not find help documentation to find the philosophizing again after Google

There is a chm file installation package, installed under lib \ site-packages directory.

python-2.7 - python read using custom document properties in MS Word file

How to use python get file attributes MS-Word 2010 document?
Document Properties I mean people who can add or modify under FILE -> Information -> Properties -> Advanced Properties (in the MS-WORD 2010)

I use python 2.7 and the corresponding version on windows764bit pywin32com to access doc files ...

I found CustomProperty objects with the names and values of method seems to me to be correct ( http://msdn.microsoft.com/en-us/library/bb257518(v=office.12).aspx )

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(file)
try:
    csp= doc.CustomDocumentProperties('property_you_want_to_know').value
    print('property is %s' % csp)

except exception as e:
    print ('\n\n', e)

doc.Saved= False
doc.Save()
doc.Close()

word.Quit()
Published 30 original articles · won praise 3 · views 10000 +

Guess you like

Origin blog.csdn.net/djfjkj52/article/details/91347013