【Python-ERROR】‘gbk‘ codec can‘t encode character ‘\xa0‘ or ‘\u2003‘ in position XXX

Python decodes an error when printing output

1. Error encountered

When crawling web pages through Python, I use the print function to output the content for debugging, but I always encounter characters that cannot be decoded. I have tried various methods to no avail. The error code is as follows:

‘gbk’ codec can’t encode character ‘\xa0’ in position 8186: illegal multibyte sequence
‘gbk’ codec can’t encode character ‘\u2003’ in position 7254: illegal multibyte sequence

2. Solutions

1. There is "\xa0" in the extracted information, and it cannot be removed. After consulting the relevant information, it is found that this character represents a space.

\xa0 is a non-breaking space character The space  
we usually use is \x20, which is in the range of standard ASCII visible characters 0x20~0x7e. And \xa0 belongs to the extended character set character in latin1 (ISO/IEC_8859-1), which represents the blank character nbsp (non-breaking space). The latin1 character set is backward compatible with ASCII ( 0x20~0x7e ).

You can use the translate method, split() to solve, and you can also replace \t \n characters, take split() as an example:

>>> s
'T-shirt\xa0\xa0短袖圆领衫,体恤衫\xa0'
>>> out = "".join(s.split())
>>> out
'T-shirt短袖圆领衫,体恤衫'

2. There is '\u2003' in the extracted information, which can also be solved by adding the following code at the beginning of the code:

import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

3. Of course, you can comment out the print, but it is inconvenient to debug when commented.

3. Reference articles

https://blog.csdn.net/wangbowj123/article/details/78061618
https://blog.csdn.net/qq_39241986/article/details/87896088
https://blog.csdn.net/a_xixi/article/details/88030830

Guess you like

Origin blog.csdn.net/Artificial_idiots/article/details/121474878