20230507使用python3批量转换DOCX文档为TXT

20230507使用python3批量转换DOCX文档为TXT
2023/5/7 20:22

WIN10使用python3.11

# – coding: gbk –
import os
from pdf2docx import Converter
from win32com import client as wc
"""这里需要安转包pywin32com"""

# 读取pdf文件文本内容
def DocxToTxt(inputFinallyPath, outputFinallyPath):
    wordhandle = wc.Dispatch("Word.Application")
    wordhandle.Visible = 0  # 后台运行,不显示
    wordhandle.DisplayAlerts = 0  # 不警告
    doc = wordhandle.Documents.Open(inputFinallyPath)
    doc.SaveAs(outputFinallyPath, 4)  # txt=4, html=10, docx=16, pdf=17
    doc.Close


if __name__ == '__main__':

        # 输入路径
        inputPath = r'D:\pythonproject\pdf_to_txt\input'
        #输出路径,最好采用绝对路径
        outputPath = r'D:\pythonproject\pdf_to_txt\output'
      
        # 将文件夹的文件列举出来
        pdfList = os.listdir(inputPath)
        # 批量读取存储
        pdf_num = 1
        for li in pdfList:
            print(li)
            inputFinallyPath = inputPath + '/' + li
            li = li.replace('.docx', '.txt')
            outputFinallyPath = outputPath + '/' + li
            DocxToTxt(inputFinallyPath, outputFinallyPath)
            print('第 %d 篇docx已转换为txt' % pdf_num)
            pdf_num = pdf_num + 1
        print('共计%d篇docx文章已完全转换为txt' % (pdf_num-1))


使用google翻译将88份日语DOCX字幕翻译成为简体中文版本了!
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。

C:\Users\QQ>python3

C:\Users\QQ>python

C:\Users\QQ>python
Python 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from pdf2docx import Converter
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pdf2docx'
>>>


Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。

C:\Users\QQ>pip install pdf2docx
Collecting pdf2docx
  Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
     ---------------------------------------- 148.4/148.4 kB 368.3 kB/s eta 0:00:00
Collecting PyMuPDF>=1.19.0
  Downloading PyMuPDF-1.22.2-cp311-cp311-win_amd64.whl (11.7 MB)
     ---------------------------------------- 11.7/11.7 MB 12.8 MB/s eta 0:00:00
Collecting python-docx>=0.8.10
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
     ---------------------------------------- 5.6/5.6 MB 1.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting fonttools>=4.24.0
  Downloading fonttools-4.39.3-py3-none-any.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 12.8 MB/s eta 0:00:00
Collecting numpy>=1.17.2
  Downloading numpy-1.24.3-cp311-cp311-win_amd64.whl (14.8 MB)
     ---------------------------------------- 14.8/14.8 MB 21.1 MB/s eta 0:00:00
Collecting opencv-python>=4.5
  Downloading opencv_python-4.7.0.72-cp37-abi3-win_amd64.whl (38.2 MB)
     ---------------------------------------- 38.2/38.2 MB 12.6 MB/s eta 0:00:00
Collecting fire>=0.3.0
  Downloading fire-0.5.0.tar.gz (88 kB)
     ---------------------------------------- 88.3/88.3 kB 4.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting termcolor
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting lxml>=2.3.2
  Downloading lxml-4.9.2-cp311-cp311-win_amd64.whl (3.8 MB)
     ---------------------------------------- 3.8/3.8 MB 10.0 MB/s eta 0:00:00
Installing collected packages: termcolor, six, PyMuPDF, numpy, lxml, fonttools, python-docx, opencv-python, fire, pdf2docx
  WARNING: The script f2py.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts fonttools.exe, pyftmerge.exe, pyftsubset.exe and ttx.exe are installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  DEPRECATION: python-docx is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for python-docx ... done
  DEPRECATION: fire is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for fire ... done
  WARNING: The script pdf2docx.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed PyMuPDF-1.22.2 fire-0.5.0 fonttools-4.39.3 lxml-4.9.2 numpy-1.24.3 opencv-python-4.7.0.72 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.3.0

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>

 


Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。

C:\Users\QQ>pip install win32com
ERROR: Could not find a version that satisfies the requirement win32com (from versions: none)
ERROR: No matching distribution found for win32com

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>pip install pypwin32
ERROR: Could not find a version that satisfies the requirement pypwin32 (from versions: none)
ERROR: No matching distribution found for pypwin32

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>pip install  pypiwin32
Collecting pypiwin32
  Downloading pypiwin32-223-py3-none-any.whl (1.7 kB)
Collecting pywin32>=223
  Downloading pywin32-306-cp311-cp311-win_amd64.whl (9.2 MB)
     ---------------------------------------- 9.2/9.2 MB 895.2 kB/s eta 0:00:00
Installing collected packages: pywin32, pypiwin32
Successfully installed pypiwin32-223 pywin32-306

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

C:\Users\QQ>
C:\Users\QQ>

 

 


Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。

C:\Users\QQ>d:

D:\>dir *.pty
 驱动器 D 中的卷是 DATA
 卷的序列号是 547F-1046

 D:\ 的目录

找不到文件

D:\>dir *.py
 驱动器 D 中的卷是 DATA
 卷的序列号是 547F-1046

 D:\ 的目录

2023/05/07  19:55             1,221 pdf2doc2.py
               1 个文件          1,221 字节
               0 个目录 195,912,142,848 可用字节

D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details

D:\>


Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。

C:\Users\QQ>d:

D:\>dir *.pty
 驱动器 D 中的卷是 DATA
 卷的序列号是 547F-1046

 D:\ 的目录

找不到文件

D:\>dir *.py
 驱动器 D 中的卷是 DATA
 卷的序列号是 547F-1046

 D:\ 的目录

2023/05/07  19:55             1,221 pdf2doc2.py
               1 个文件          1,221 字节
               0 个目录 195,912,142,848 可用字节

D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details

D:\>
D:\>python pdf2doc2.py
  File "D:\pdf2doc2.py", line 36
    print('共计%d篇docx文章已完全转换为txt' pdf_num-1))
                                           ^
SyntaxError: unmatched ')'

D:\>python pdf2doc2.py
MIDE-599.google.docx
第 1 篇docx已转换为txt
OAE-101.google.docx
第 2 篇docx已转换为txt
OAE-165.google.docx
第 3 篇docx已转换为txt
OFJE-139 1.google.docx
第 4 篇docx已转换为txt
OFJE-139 2.google.docx
第 5 篇docx已转换为txt
OFJE-189.google.docx
第 6 篇docx已转换为txt
OFJE-236.google.docx
第 7 篇docx已转换为txt
pSSNI-473.google.docx
第 8 篇docx已转换为txt
SIVR-001.google.docx
第 9 篇docx已转换为txt
SIVR-002.google.docx
第 10 篇docx已转换为txt
SIVR-003.google.docx
第 11 篇docx已转换为txt
SIVR-012 1.google.docx
第 12 篇docx已转换为txt
SIVR-012 2.google.docx
第 13 篇docx已转换为txt
SIVR-015 1.google.docx
第 14 篇docx已转换为txt
SIVR-015 2.google.docx
第 15 篇docx已转换为txt
SIVR-016 1.google.docx
第 16 篇docx已转换为txt
SIVR-016 2.google.docx
第 17 篇docx已转换为txt
SIVR-017 1.google.docx
第 18 篇docx已转换为txt
SIVR-017 2.google.docx
第 19 篇docx已转换为txt
SIVR-017 3.google.docx
第 20 篇docx已转换为txt
SIVR-033 1.google.docx
第 21 篇docx已转换为txt
SIVR-033 2.google.docx
第 22 篇docx已转换为txt
SIVR-033 3.google.docx
第 23 篇docx已转换为txt
SIVR-033 4.google.docx
第 24 篇docx已转换为txt
SIVR-033 5.google.docx
第 25 篇docx已转换为txt
SIVR-033 6.google.docx
第 26 篇docx已转换为txt
SIVR-034 1.google.docx
第 27 篇docx已转换为txt
SIVR-034 2.google.docx
第 28 篇docx已转换为txt
SIVR-034 3.google.docx
第 29 篇docx已转换为txt
SIVR-044 1.google.docx
第 30 篇docx已转换为txt
SIVR-044 2.google.docx
第 31 篇docx已转换为txt
SIVR-061 1.google.docx
第 32 篇docx已转换为txt
SIVR-061 2.google.docx
第 33 篇docx已转换为txt
SIVR-061 3.google.docx
第 34 篇docx已转换为txt
SIVR-061 4.google.docx
第 35 篇docx已转换为txt
SIVR-067 1.google.docx
第 36 篇docx已转换为txt
SIVR-067 2.google.docx
第 37 篇docx已转换为txt
SIVR-067 3.google.docx
第 38 篇docx已转换为txt
SNIS-786.google.docx
第 39 篇docx已转换为txt
SNIS-800.google.docx
第 40 篇docx已转换为txt
SNIS-850 1.google.docx
第 41 篇docx已转换为txt
SNIS-850 2.google.docx
第 42 篇docx已转换为txt
SNIS-872.google.docx
第 43 篇docx已转换为txt
SNIS-896.google.docx
第 44 篇docx已转换为txt
SNIS-919.google.docx
第 45 篇docx已转换为txt
SNIS-964.google.docx
第 46 篇docx已转换为txt
SNIS-964.google2.docx
第 47 篇docx已转换为txt
SNIS-986.google.docx
第 48 篇docx已转换为txt
SSNI-009.google.docx
第 49 篇docx已转换为txt
SSNI-030.google.docx
第 50 篇docx已转换为txt
SSNI-054.google.docx
第 51 篇docx已转换为txt
SSNI-077.google.docx
第 52 篇docx已转换为txt
SSNI-101.google.docx
第 53 篇docx已转换为txt
SSNI-127.google.docx
第 54 篇docx已转换为txt
SSNI-152.google.docx
第 55 篇docx已转换为txt
SSNI-178.google.docx
第 56 篇docx已转换为txt
SSNI-205.google.docx
第 57 篇docx已转换为txt
SSNI-229.google.docx
第 58 篇docx已转换为txt
SSNI-254.google.docx
第 59 篇docx已转换为txt
SSNI-279.google.docx
第 60 篇docx已转换为txt
SSNI-301.google.docx
第 61 篇docx已转换为txt
SSNI-322.google.docx
第 62 篇docx已转换为txt
SSNI-344.google.docx
第 63 篇docx已转换为txt
SSNI-388.google.docx
第 64 篇docx已转换为txt
SSNI-409.google.docx
第 65 篇docx已转换为txt
SSNI-432.google.docx
第 66 篇docx已转换为txt
SSNI-452.google.docx
第 67 篇docx已转换为txt
SSNI-473.google.docx
第 68 篇docx已转换为txt
SSNI-493.google.docx
第 69 篇docx已转换为txt
SSNI-516.google.docx
第 70 篇docx已转换为txt
SSNI-542.google.docx
第 71 篇docx已转换为txt
SSNI-566.google.docx
第 72 篇docx已转换为txt
SSNI-589.google.docx
第 73 篇docx已转换为txt
SSNI-618.google.docx
第 74 篇docx已转换为txt
SSNI-644.google.docx
第 75 篇docx已转换为txt
SSNI-674.google.docx
第 76 篇docx已转换为txt
SSNI-703.google.docx
第 77 篇docx已转换为txt
SSNI-730.google.docx
第 78 篇docx已转换为txt
TEK-067.google.docx
第 79 篇docx已转换为txt
TEK-071.google.docx
第 80 篇docx已转换为txt
TEK-072.google.docx
第 81 篇docx已转换为txt
TEK-073.google.docx
第 82 篇docx已转换为txt
TEK-076.google.docx
第 83 篇docx已转换为txt
TEK-079只有音频.google.docx
第 84 篇docx已转换为txt
TEK-080.google.docx
第 85 篇docx已转换为txt
TEK-081只有音频.google.docx
第 86 篇docx已转换为txt
TEK-083只有音频.google.docx
第 87 篇docx已转换为txt
TEK-097.google.docx
第 88 篇docx已转换为txt

D:\>


参考资料:
python 批量 转换 DOCX TXT


https://blog.csdn.net/weixin_46255747/article/details/129961988
python实现批量docx转txt


ModuleNotFoundError: No module named 'pdf2docx'


python win32com pip install


https://blog.csdn.net/qq_45662588/article/details/130315080
python3.9之安装win32com库的解决办法


https://blog.csdn.net/longe20111104/article/details/129754624
pip install win32com报错解决办法
pip install  pypiwin32


SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding d


https://blog.csdn.net/coco_apple/article/details/113437552
SyntaxError: Non-UTF-8 code starting with ‘\xd5‘ in file
# – coding: gbk –

 

 

 

猜你喜欢

转载自blog.csdn.net/wb4916/article/details/130547425