Catalog title
PyPDF2 Chinese settings
PyPDF2 is Latin-1 encoded by default, and it will report an error when processing Chinese documents.
The content of this article is common for Linux and Windows 已测试
Quick method: (overwrite file)
Configuration file download
downloaded generic.py
and utils.py
copied to a directory C:\Users\currentuser\AppData\Local\Programs\Python\Python39\Lib\site-packages\PyPDF2
of buttons
under Linux find it
Custom: (modify the configuration file yourself)
The utils.py
inside is probably about 240 lines:
r = s.encode('latin-1')
if len(s) < 2:
bc[s] = r
return r
change into
r = s.encode('utf-8')
if len(s) < 2:
bc[s] = r
return r
The generic.py
content of around about 480 lines
try:
return NameObject(name.decode('utf-8'))
except (UnicodeEncodeError, UnicodeDecodeError) as e:
# Name objects should represent irregular characters
# with a '#' followed by the symbol's hex number
if not pdf.strict:
warnings.warn("Illegal character in Name Object", utils.PdfReadWarning)
return NameObject(name)
else:
raise utils.PdfReadError("Illegal character in Name Object")
change into
try:
return NameObject(name.decode('utf-8'))
except (UnicodeEncodeError, UnicodeDecodeError) as e:
try:
return NameObject(name.decode('gbk'))
except (UnicodeEncodeError, UnicodeDecodeError) as e:
# Name objects should represent irregular characters
# with a '#' followed by the symbol's hex number
if not pdf.strict:
warnings.warn("Illegal character in Name Object", utils.PdfReadWarning)
return NameObject(name)
else:
raise utils.PdfReadError("Illegal character in Name Object")
The content of the article is over. The above content has been tested and passed on both Windows and Linux platforms on January 9, 2021.