Article directory
1. How to install and uninstall libraries in Python
1.1 Installation
Search in the search bar cmd
, and then enter pip install
+ 库名字
, for example, if you need to install python-pptx, then enter pip install python-docx
.
1.2 Uninstall
Search in the search bar cmd
, and then enter pip uninstall
+ 库名字
, for example, if you need to uninstall python-pptx, then enter pip uninstall python-docx
.
2. Tools
What I use is pycharm
, personally feel better to use.
Click here to download the configuration tutorial
3. Convert PPT content to Word
3.1 Convert the text in the PPT text box to Word
3.1.1 Required libraries
The libraries that need to be downloaded are: python-pptx
and python-docx
two.
3.1.2 Implementation code
from pptx import Presentation
from docx import Document
wordfile = Document()
# 给定ppt文件所在的路径
filepath = r'E:\vs\w.pptx'
pptx = Presentation(filepath)
# 遍历ppt文件的所有幻灯片页
for slide in pptx.slides:
# 遍历幻灯片页的所有形状
for shape in slide.shapes:
# 判断形状是否含有文本框,如果含有则顺序运行代码
if shape.has_text_frame:
# 获取文本框
text_frame = shape.text_frame
# 遍历文本框中的所有段落
for paragraph in text_frame.paragraphs:
# 将文本框中的段落文字写入word中
wordfile.add_paragraph(paragraph.text)
#word文档存放的路径
save_path = r'E:\vs\w.docx'
wordfile.save(save_path)
3.1.3 Specific explanation
For a specific explanation, click here
or this one.
The point to note is that the * behind the Document inside must be removed, otherwise an error will be reported, and there is a wordfile.add_paragraph(paragraph.text
lack of it in this sentence )
.
3.2 Export the PPT pictures as well
3.2.1 Required libraries
The library that needs to be downloaded is: python-pptx
.
3.2.2 Implementation code
If there is a picture, it will be directly stored in the current Pycharm project folder.
from pptx import Presentation
from pptx.shapes.picture import Picture
prs = Presentation("E:\k\w.pptx")#这是你ppt的路径
index = 1
#读取幻灯片的每一页
for slide in prs.slides:
# 读取每一板块
for shape in slide.shapes:
# print(dir(shape))
#是否有文字框
if shape.has_text_frame:
#读文字框的每一段落
for paragraph in shape.text_frame.paragraphs:
if paragraph.text:
# 输出段落文字,也有一些属性,可以用dir查看
# print(dir(paragraph))
print(paragraph.text)
#是否有表格
elif shape.has_table:
one_table_data = []
for row in shape.table.rows: # 读每行
row_data = []
for cell in row.cells: # 读一行中的所有单元格
c = cell.text
row_data.append(c)
one_table_data.append(row_data) # 把每一行存入表
#用二维列表输出表格行和列的数据
print(one_table_data)
# 是否有图片
elif isinstance(shape, Picture):
#shape.image.blob:二进制图像字节流,写入图像文件
with open(f'{
index}.jpg', 'wb') as f:
f.write(shape.image.blob)
index += 1
4. Use OCR to store image information in Word
4.1 Use Baidu API
4.1.1 Required libraries
The libraries that need to be downloaded are: baidu-aip
and python-docx
two.
4.1.2 Implementation code
# 从相应的aip导入AipOcr模块
from aip import AipOcr
from docx import Document
wordfile = Document()
# 输入凭证
APP_ID = "19307867"
API_Key = "HM1UDlzRPrr7TE6xw9YHDSnZ"
Secret_Key = "6jUGVGRLMrbByWz0vPs5w5NOS8m6GMOl"
aipOcr = AipOcr(APP_ID, API_Key, Secret_Key)
# 输入资源
filePath = r"D:\pythonProject"
for i in range(1, 499):
filePath1 = filePath + "\\" + str(i) + ".jpg" # 最好是jpg,名称统一
image = open(filePath1, "rb").read()
# 接通ocr接口
result = aipOcr.basicGeneral(image)
# 输出
mywords = result["words_result"]
for i in range(len(mywords)):
print(mywords[i]["words"])
wordfile.add_paragraph(mywords[i]["words"])
save_path = r'E:\vs\w.docx'
wordfile.save(save_path)
4.1.3 Specific explanation
Detailed explanation click here
4.1.4 Existing problems
This is to call the api directly. If it is called multiple times, it will become invalid, and it will take a while before it can be called again. Therefore, this method is not recommended if a large number of recognition is required.
4.2 OCR recognition based on tesseract
4.2.1 Required libraries and software
The libraries that need to be downloaded are: python-pptx
, pytesseract
and pillow
.
The software that needs to be downloaded is tesseract
the specific installation and configuration methods refer to other bloggers
4.2.2 Implementation code
from PIL import Image
import pytesseract
import pptx
image = Image.open(r'2.jpg')#打开图片
result = pytesseract.image_to_string(image,lang='eng')#使用简体中文字库识别图片并返回结果
print(result)#打印识别的图片内容
4.2.3 Specific explanation
Python3 uses pytesseract for image recognitionPython3.6
realizes image conversion
4.2.4 Existing problems
The recognition rate of this method just doesn't meet my requirements, because it can't recognize the positive and negative numbers in the picture very well.
4.3 Some additional methods
4.3.1 Libraries to be installed
The libraries that need to be downloaded are: requests
and python-docx
(required to convert to word).
4.3.2 Implementation code
import requests
import base64
from docx import Document
wordfile = Document()
def ocr(img_path: str) -> list:
'''
根据图片路径,将图片转为文字,返回识别到的字符串列表
'''
# 请求头
headers = {
'Host': 'cloud.baidu.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.76',
'Accept': '*/*',
'Origin': 'https://cloud.baidu.com',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://cloud.baidu.com/product/ocr/general',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
}
# 打开图片并对其使用 base64 编码
with open(img_path, 'rb') as f:
img = base64.b64encode(f.read())
data = {
'image': 'data:image/jpeg;base64,'+str(img)[2:-1],
'image_url': '',
'type': 'https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic',
'detect_direction': 'false'
}
# 开始调用 ocr 的 api
response = requests.post(
'https://cloud.baidu.com/aidemo', headers=headers, data=data)
# 设置一个空的列表,后面用来存储识别到的字符串
ocr_text = []
result = response.json()['data']
if not result.get('words_result'):
return []
# 将识别的字符串添加到列表里面
for r in result['words_result']:
text = r['words'].strip()
ocr_text.append(" ")
ocr_text.append(text)
wordfile.add_paragraph(ocr_text)
# 返回字符串列表
return ocr_text
'''
img_path 里面填图片路径,这里分两种情况讨论:
第一种:假设你的代码跟图片是在同一个文件夹,那么只需要填文件名,例如 test1.jpg (test1.jpg 是图片文件名)
第二种:假设你的图片全路径是 D:/img/test1.jpg ,那么你需要填 D:/img/test1.jpg
'''
for i in range(300, 400):
img_path10 = str(i) + ".jpg"
content = "".join(ocr(img_path10))
print(content)
# img_path = '2.jpg'
# # content 是识别后得到的结果
# content = "".join(ocr(img_path))
# # 输出结果
# print(content)
save_path = r'E:\vs\前400.docx'
wordfile.save(save_path)
4.3.3 Specific explanation
Use Python to quickly realize image text recognition (30 lines of code)
4.3.4 Existing problems
This one is essentially calling the api interface, so it will become invalid after calling it many times, and it will take a while before it can continue to be used.
5. Notes
It is best to choose the Python interpreter under the path where you installed Python python.exe
, so as to avoid the problem that you have installed some libraries, but when you compile, it shows that the library cannot be found.
--------------------------------Unfinished below--------------- ---------------
Problems that arise