Successfully solved UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xa6 in position 2192: illegal multibyte seque

Project scenario:

In the process of language processing, the content in the txt text file needs to be read.

Problem Description

UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xa6 in position 2192: illegal multibyte sequence

Cause Analysis:

This error usually means that the wrong encoding format was used when performing the decoding operation, resulting in the inability to decode certain characters properly. For example, in this specific error message, the 'gbk' encoder tried to decode a byte string, but found that there was 0xa6 byte in the byte string, which did not conform to the 'gbk' encoding format, so it threw A UnicodeDecodeError exception occurred.

solution:

(1) Try changing the encoding format of how the file is read. You can try using the default 'utf-8' encoding. For example:

filename = 'text.txt'

# 使用默认编码 UTF-8 打开文件
with open(filename, 'r', encoding='utf-8') as f:
    # 处理文件内容
    content = f.read()
    print(content)

If the file really does not have UTF-8 encoding, you can also try to use other possible encoding forms, such as 'gb18030', 'big5', etc. If you don’t know the file encoding, you can use chardet to detect the encoding and open it, for example:

import chardet

filename = 'text.txt'

# 检测文件编码
with open(filename, 'rb') as f:
    result = chardet.detect(f.read())
    encoding = result['encoding']
    
# 使用正确编码打开文件
with open(filename, 'r', encoding=encoding) as f:
    # 处理文件内容
    content = f.read()
    print(content)

(2) Use the codecs library to specify the encoding format and error handler for processing, for example:

import codecs

filename = 'text.txt'

# 使用 codecs 库指定编码格式和错误处理器来打开文件
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
    # 处理文件内容
    content = f.read()
    print(content)

(3) You can try to use binary mode to read the file to better handle characters in different encoding formats. For example, use 'rb' (binary mode) instead of 'r' (text mode) to read the file.

(4) If there are some special characters/symbols in the data set, these characters may need to be processed manually. These characters can be removed or replaced in the data set.

(5) You can re-download/obtain the dataset and make sure you use the correct encoding to open it.