Python decodes an error when printing output
1. Error encountered
When crawling web pages through Python, I use the print function to output the content for debugging, but I always encounter characters that cannot be decoded. I have tried various methods to no avail. The error code is as follows:
‘gbk’ codec can’t encode character ‘\xa0’ in position 8186: illegal multibyte sequence
‘gbk’ codec can’t encode character ‘\u2003’ in position 7254: illegal multibyte sequence
2. Solutions
1. There is "\xa0" in the extracted information, and it cannot be removed. After consulting the relevant information, it is found that this character represents a space.
\xa0 is a non-breaking space character The space
we usually use is \x20, which is in the range of standard ASCII visible characters 0x20~0x7e. And \xa0 belongs to the extended character set character in latin1 (ISO/IEC_8859-1), which represents the blank character nbsp (non-breaking space). The latin1 character set is backward compatible with ASCII ( 0x20~0x7e ).
You can use the translate method, split() to solve, and you can also replace \t \n characters, take split() as an example:
>>> s
'T-shirt\xa0\xa0短袖圆领衫,体恤衫\xa0'
>>> out = "".join(s.split())
>>> out
'T-shirt短袖圆领衫,体恤衫'
2. There is '\u2003' in the extracted information, which can also be solved by adding the following code at the beginning of the code:
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
3. Of course, you can comment out the print, but it is inconvenient to debug when commented.
3. Reference articles
https://blog.csdn.net/wangbowj123/article/details/78061618
https://blog.csdn.net/qq_39241986/article/details/87896088
https://blog.csdn.net/a_xixi/article/details/88030830