A piece of text recently crawled from a web page looks like this (with newlines and spaces):
Bitcoin
比特币
When I use python's string replace method to replace spaces and \n, and then print, I found this situation:
比特币
wtf? I just did this:
content = content.replace(' ', '')
content = content.replace('\n', '')
print content
Then the Bitcoin field disappears? I did not replace Bitcoin
What's going wrong?
The answer was beyond my expectations -
We add a square bracket to turn it into an array so we can see the encoding, so:
content = tag.h1.get_text()
print [content], "--------------\n", content
content = content.replace(' ', '')
print [content], "--------------\n", content
content = content.replace('\n', '')
print [content], "--------------\n", content
It prints out:
[u'\n\r\nBitcoin\r\n \r\n\u6bd4\u7279\u5e01 '] --------------
Bitcoin
比特币
[u'\n\r\nBitcoin\r\n\r\n\u6bd4\u7279\u5e01'] --------------
Bitcoin
比特币
[u'\rBitcoin\r\r\u6bd4\u7279\u5e01'] --------------
比特币
It actually contains \r (most likely a problem specific to the windows operating system) How do we restore it? Of course, replace \r as well
content = content.replace('\r', '')
print [content], "--------------\n", content
# 打印结果
[u'Bitcoin\u6bd4\u7279\u5e01'] --------------
Bitcoin比特币