Python implements PPT conversion into Word and OCR recognition

1. How to install and uninstall libraries in Python

1.1 Installation

Search in the search bar cmd, and then enter pip install + 库名字, for example, if you need to install python-pptx, then enter pip install python-docx.
insert image description here

1.2 Uninstall

Search in the search bar cmd, and then enter pip uninstall + 库名字, for example, if you need to uninstall python-pptx, then enter pip uninstall python-docx.

2. Tools

What I use is pycharm, personally feel better to use.
Click here to download the configuration tutorial
insert image description here

3. Convert PPT content to Word

3.1 Convert the text in the PPT text box to Word

3.1.1 Required libraries

The libraries that need to be downloaded are: python-pptxand python-docxtwo.

3.1.2 Implementation code

from pptx import Presentation
from docx import Document

wordfile = Document()
# 给定ppt文件所在的路径
filepath = r'E:\vs\w.pptx'
pptx = Presentation(filepath)
# 遍历ppt文件的所有幻灯片页
for slide in pptx.slides:
    # 遍历幻灯片页的所有形状
    for shape in slide.shapes:
    # 判断形状是否含有文本框,如果含有则顺序运行代码
        if shape.has_text_frame:
            # 获取文本框
            text_frame = shape.text_frame
            # 遍历文本框中的所有段落
            for paragraph in text_frame.paragraphs:
                # 将文本框中的段落文字写入word中
                wordfile.add_paragraph(paragraph.text)
#word文档存放的路径
save_path = r'E:\vs\w.docx'
wordfile.save(save_path)

3.1.3 Specific explanation

For a specific explanation, click here
or this one.
The point to note is that the * behind the Document inside must be removed, otherwise an error will be reported, and there is a wordfile.add_paragraph(paragraph.textlack of it in this sentence ).

3.2 Export the PPT pictures as well

3.2.1 Required libraries

The library that needs to be downloaded is: python-pptx.

3.2.2 Implementation code

If there is a picture, it will be directly stored in the current Pycharm project folder.

from pptx import Presentation
from pptx.shapes.picture import Picture

prs = Presentation("E:\k\w.pptx")#这是你ppt的路径
index = 1
#读取幻灯片的每一页
for slide in prs.slides:
    # 读取每一板块
    for shape in slide.shapes:
        # print(dir(shape))
        #是否有文字框
        if shape.has_text_frame:
            #读文字框的每一段落
            for paragraph in shape.text_frame.paragraphs:
                if paragraph.text:
                    # 输出段落文字,也有一些属性,可以用dir查看
                    # print(dir(paragraph))
                    print(paragraph.text)
        #是否有表格
        elif shape.has_table:
            one_table_data = []
            for row in shape.table.rows:  # 读每行
                row_data = []
                for cell in row.cells:  # 读一行中的所有单元格
                    c = cell.text
                    row_data.append(c)
                one_table_data.append(row_data)  # 把每一行存入表
            #用二维列表输出表格行和列的数据
            print(one_table_data)
        # 是否有图片
        elif isinstance(shape, Picture):
            #shape.image.blob:二进制图像字节流,写入图像文件
            with open(f'{
      
      index}.jpg', 'wb') as f:
                f.write(shape.image.blob)
                index += 1

4. Use OCR to store image information in Word

4.1 Use Baidu API

4.1.1 Required libraries

The libraries that need to be downloaded are: baidu-aipand python-docxtwo.

4.1.2 Implementation code

# 从相应的aip导入AipOcr模块
from aip import AipOcr
from docx import Document


wordfile = Document()
# 输入凭证
APP_ID = "19307867"
API_Key = "HM1UDlzRPrr7TE6xw9YHDSnZ"
Secret_Key = "6jUGVGRLMrbByWz0vPs5w5NOS8m6GMOl"
aipOcr = AipOcr(APP_ID, API_Key, Secret_Key)

# 输入资源
filePath = r"D:\pythonProject"
for i in range(1, 499):
    filePath1 = filePath + "\\" + str(i) + ".jpg" # 最好是jpg,名称统一
    image = open(filePath1, "rb").read()
    # 接通ocr接口
    result = aipOcr.basicGeneral(image)
    # 输出
    mywords = result["words_result"]
    for i in range(len(mywords)):
        print(mywords[i]["words"])
        wordfile.add_paragraph(mywords[i]["words"])

save_path = r'E:\vs\w.docx'
wordfile.save(save_path)

4.1.3 Specific explanation

Detailed explanation click here

4.1.4 Existing problems

This is to call the api directly. If it is called multiple times, it will become invalid, and it will take a while before it can be called again. Therefore, this method is not recommended if a large number of recognition is required.

4.2 OCR recognition based on tesseract

4.2.1 Required libraries and software

The libraries that need to be downloaded are: python-pptx, pytesseractand pillow.
The software that needs to be downloaded is tesseract
the specific installation and configuration methods refer to other bloggers

4.2.2 Implementation code

from PIL import Image
import pytesseract
import pptx

image = Image.open(r'2.jpg')#打开图片
result = pytesseract.image_to_string(image,lang='eng')#使用简体中文字库识别图片并返回结果
print(result)#打印识别的图片内容

4.2.3 Specific explanation

Python3 uses pytesseract for image recognitionPython3.6
realizes image conversion

4.2.4 Existing problems

The recognition rate of this method just doesn't meet my requirements, because it can't recognize the positive and negative numbers in the picture very well.

4.3 Some additional methods

4.3.1 Libraries to be installed

The libraries that need to be downloaded are: requestsand python-docx(required to convert to word).

4.3.2 Implementation code

import requests
import base64
from docx import Document


wordfile = Document()

def ocr(img_path: str) -> list:
    '''
    根据图片路径,将图片转为文字,返回识别到的字符串列表

    '''
    # 请求头
    headers = {
    
    
        'Host': 'cloud.baidu.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.76',
        'Accept': '*/*',
        'Origin': 'https://cloud.baidu.com',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Dest': 'empty',
        'Referer': 'https://cloud.baidu.com/product/ocr/general',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    }
    # 打开图片并对其使用 base64 编码
    with open(img_path, 'rb') as f:
        img = base64.b64encode(f.read())
    data = {
    
    
        'image': 'data:image/jpeg;base64,'+str(img)[2:-1],
        'image_url': '',
        'type': 'https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic',
        'detect_direction': 'false'
    }
    # 开始调用 ocr 的 api
    response = requests.post(
        'https://cloud.baidu.com/aidemo', headers=headers, data=data)

    # 设置一个空的列表,后面用来存储识别到的字符串
    ocr_text = []
    result = response.json()['data']
    if not result.get('words_result'):
        return []

    # 将识别的字符串添加到列表里面
    for r in result['words_result']:
        text = r['words'].strip()
        ocr_text.append("  ")
        ocr_text.append(text)
    wordfile.add_paragraph(ocr_text)
    # 返回字符串列表
    return ocr_text


'''
img_path 里面填图片路径,这里分两种情况讨论:
第一种:假设你的代码跟图片是在同一个文件夹,那么只需要填文件名,例如 test1.jpg (test1.jpg 是图片文件名)
第二种:假设你的图片全路径是 D:/img/test1.jpg ,那么你需要填 D:/img/test1.jpg
'''

for i in range(300, 400):
    img_path10 = str(i) + ".jpg"
    content = "".join(ocr(img_path10))
    print(content)

# img_path = '2.jpg'
# # content 是识别后得到的结果
# content = "".join(ocr(img_path))
# # 输出结果
# print(content)
save_path = r'E:\vs\前400.docx'
wordfile.save(save_path)

4.3.3 Specific explanation

Use Python to quickly realize image text recognition (30 lines of code)

4.3.4 Existing problems

This one is essentially calling the api interface, so it will become invalid after calling it many times, and it will take a while before it can continue to be used.

5. Notes

It is best to choose the Python interpreter under the path where you installed Python python.exe, so as to avoid the problem that you have installed some libraries, but when you compile, it shows that the library cannot be found.
insert image description here
--------------------------------Unfinished below--------------- ---------------
Problems that arise

After Python installs the package through pip on the terminal, the solution is still unavailable in Pycharm
insert image description here

Guess you like

Origin blog.csdn.net/weixin_52296952/article/details/123625760