UnicodeDecodeError: ‘charmap‘ codec can‘t decode byte 0x90 in position 1543: character maps to <unde

问题背景

在对html的二进制源码进行解码(即将bytes转化成str)时,遇到了如下报错:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1543: character maps to <undefined>

以下是具体执行的代码:

with open('1.html', 'rb') as r:
    content = r.read()
    encoding = chardet.detect(content)['encoding']
    if encoding is None:
        encoding = 'utf-8'
    content = content.decode(encoding)
    print(content)

解决方案

即使是使用检测到的encoding进行解码,也依然可能出错,因为检测到的编码方式不一定就是正确的。此时可直接使用utf-8进行解码:

with open('1.html', 'rb') as r:
    content = r.read()
    encoding = chardet.detect(content)['encoding']
    if encoding is None:
        encoding = 'utf-8'
    try:
        content = content.decode(encoding)
    except:
        content = content.decode('utf-8')
    print(content)

事实上,在本例中,chardet.detect(content) 的输出是

{'encoding': 'Windows-1254', 'confidence': 0.417065260641214, 'language': 'Turkish'}

可以看出置信度非常低。

⚠️ rbencoding 参数不兼容,如果同时指定,会报错:ValueError: binary mode doesn't take an encoding argument

猜你喜欢

转载自blog.csdn.net/raelum/article/details/133364811