Detailed explanation of python replacement text causing text loss

A piece of text recently crawled from a web page looks like this (with newlines and spaces):



Bitcoin
                            
比特币                        

When I use python's string replace method to replace spaces and \n, and then print, I found this situation:

比特币

wtf? I just did this:

content = content.replace(' ', '')
content = content.replace('\n', '')
print content

Then the Bitcoin field disappears? I did not replace Bitcoin

What's going wrong?

 

The answer was beyond my expectations -

 

We add a square bracket to turn it into an array so we can see the encoding, so:

content = tag.h1.get_text()
print [content], "--------------\n", content
content = content.replace(' ', '')
print [content], "--------------\n", content
content = content.replace('\n', '')
print [content], "--------------\n", content

It prints out:

[u'\n\r\nBitcoin\r\n                            \r\n\u6bd4\u7279\u5e01                        '] --------------


Bitcoin
                            
比特币                        
[u'\n\r\nBitcoin\r\n\r\n\u6bd4\u7279\u5e01'] --------------


Bitcoin

比特币
[u'\rBitcoin\r\r\u6bd4\u7279\u5e01'] --------------
比特币

 

It actually contains \r (most likely a problem specific to the windows operating system) How do we restore it? Of course, replace \r as well

content = content.replace('\r', '')
print [content], "--------------\n", content

# 打印结果
[u'Bitcoin\u6bd4\u7279\u5e01'] --------------
Bitcoin比特币

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325508305&siteId=291194637