Python reads files in xls, xlsx, csv, doc, docx, pdf format

foreword

Use python to read xls , xlsx , csv , doc , docx , pdf format files
python version 3.10.4

read xls

The .et file of pip install xlrd==2.0.1
wps can also be read

import xlrd

wb = xlrd.open_workbook(path)

# 获取所有工作表
for index,value in enumerate(wb.sheet_names()):
    sheet = wb[index]

    # 获取工作表总行数
    rows = sheet.nrows
    # 获取工作表总列数
    cols = sheet.ncols
    # 总行,总列

    # 获取某一单元格内容(行, 列),列表从0开始r
    for r in range(0, rows):
        for c in range(0, cols):
            if sheet.cell(r, c).value:
              print(sheet.cell(r, c).value)

read xlsx

pip install openpyxl==3.1.1

import openpyxl

# 获取工作簿对象
wb = openpyxl.load_workbook(path)

# 获取所有工作表
for index,value in enumerate(wb.sheetnames):
    sheet = wb[wb.sheetnames[index]]

    # 获取工作表总行数
    rows = sheet.max_row
    # 获取工作表总列数
    cols = sheet.max_column
    # 总行,总列

    # 获取某一单元格内容(行, 列),列表从1开始r
    for r in range(1, rows + 1):
        for c in range(1, cols + 1):
            if sheet.cell(row=r, column=c).value:
                print(sheet.cell(row=r, column=c).value	)
 wb.close()

read docx

from docx import Document

document = Document(path)
for paragraph in document.paragraphs:
    print(paragraph.text)

read doc

It needs to be converted to docx first, and then read through docx

import win32com.client as wc
import os
import pythoncom

# 解决 “尚未调用 CoInitialize” 问题
pythoncom.CoInitialize()
word = wc.Dispatch("Word.Application")

# doc文件另存为docx,隐藏操作,不然会显示打开文件
word.Visible = False

# 文件名包含空格会报错
os.rename(path, path.replace(" ", ""))

doc = word.Documents.Open(path.replace(" ", ""))
doc.SaveAs(path[:-4] + ".docx", 12)
doc.Close()

word.Quit()

read pdf

Install pip install pdfplumber==0.7.3

import pdfplumber

with pdfplumber.open(path) as pdf:
	for index,value in enumerate(pdf.pages):
		print(value.extract_text())

read csv

import csv

with open(path, "r") as f:
    reader = csv.reader(f)
       for row in reader:
           print(row)

Python series:
Reading files – read files in xls, xlsx, csv, doc, docx, pdf format using python

Reading Widget – Use python to develop a borderless form reading widget

Manipulate xlsx files – various operations on xlsx using openpyxl technology


Front-end series:
minesweeper game – JavaScript imitates windows to write minesweeper game

Front-end tool library xlsx handles header merging – how to use xlsx technology to handle complex header merging

CSS Layout Skills – Experience with Overall Layout

NVM Node multi-version control tutorial – Node version control artifact NVM


Spring Series:
Spring Deployment – ​​Multiple linux deployment methods of Spring

Spring implements strategy patterns – implements multiple strategy patterns through Spring

Guess you like

Origin blog.csdn.net/oldfish_/article/details/129101679