PyPDF2 Chinese configuration

PyPDF2 Chinese settings

PyPDF2 is Latin-1 encoded by default, and it will report an error when processing Chinese documents.

The content of this article is common for Linux and Windows 已测试

Quick method: (overwrite file)

Configuration file download
downloaded generic.pyand utils.pycopied to a directory C:\Users\currentuser\AppData\Local\Programs\Python\Python39\Lib\site-packages\PyPDF2of buttons
under Linux find it

Custom: (modify the configuration file yourself)

The utils.pyinside is probably about 240 lines:

 r = s.encode('latin-1')
 if len(s) < 2:
   		bc[s] = r
 return r

change into

r = s.encode('utf-8')
if len(s) < 2:
    bc[s] = r
return r

The generic.pycontent of around about 480 lines

try:
   return NameObject(name.decode('utf-8'))
except (UnicodeEncodeError, UnicodeDecodeError) as e:
   # Name objects should represent irregular characters
   # with a '#' followed by the symbol's hex number
   if not pdf.strict:
      warnings.warn("Illegal character in Name Object", utils.PdfReadWarning)
      return NameObject(name)
   else:
      raise utils.PdfReadError("Illegal character in Name Object")

change into

try:
	return NameObject(name.decode('utf-8'))
except (UnicodeEncodeError, UnicodeDecodeError) as e:
	try:
		return NameObject(name.decode('gbk'))
	except (UnicodeEncodeError, UnicodeDecodeError) as e:
		# Name objects should represent irregular characters
		# with a '#' followed by the symbol's hex number
		if not pdf.strict:
			warnings.warn("Illegal character in Name Object", utils.PdfReadWarning)
			return NameObject(name)
		else:
			raise utils.PdfReadError("Illegal character in Name Object")

The content of the article is over. The above content has been tested and passed on both Windows and Linux platforms on January 9, 2021.

Guess you like

Origin blog.csdn.net/qq_41238308/article/details/108572771
Recommended