Take you to know the error code from "�" to "Kunjin"

Starting with a five-character quatrain, do you know what is behind it?

Holding two knives

Mouth screaming hot

stepping on thousands of flowers

Look at everything with a smile

� What is it?

In this article by Brother Shitou not long ago - you may also fall into the pit of this simple String, which tells the experience of continuously stepping on the pit due to character encoding problems. There is a magical character "�" in the article.

In fact, this "�" is really everywhere, such as the famous WeChat:

in wechat

For another example, in the cover picture, the unit price of 22 yuan is "Kun jin kao kun jin kao", and then just Baidu:

ubiquitous

To clarify this problem, we have to start with coding.

Because in the eyes of the computer, it is all binary, which binary numbers are used to represent which symbol, this is the encoding. Don't think the coding is too complicated, it is actually a very simple mapping.

For example, the well-known ASCII code specifies the binary 0100 0001, which is 65 in decimal, which means the capital letter A.

ASCII encoding

� is also a coded character, just like the A above, it is a special character in the UNICODE encoding method, that is, 0xFFFD (65533), the semantics is a placeholder, used to express this encoding system The unknown, something you don't know.

For example, in the screenshot of the experiment in the previous article, the corresponding characters circled in red do not know the UTF-8 encoding, so according to the definition of UNICODE, I have to use a unified placeholder - 0xFFFD (65533) to express.

Why is there a "kunjin copy"?

Let's continue with the previous example, as shown in the figure below, still intercepting the part corresponding to the binary code of "Programmer Stone":

As shown in the figure above, the byte array new byte[] {-25, -119, -25, -116} on line 18, UTF-8 just doesn’t know it, so it can only be replaced with a placeholder.

��

This kind of situation is indeed relatively common in the process of code conversion. If the two parties do not communicate clearly, it is indeed easy for them to not know each other.

In the Chinese system, the common character encoding is GBK. At this time, because everyone has not discussed it clearly in advance, I will give you the encoding according to GBK by default.

"Kunjin copy" here

Surprised or not? Surprised or not...

In fact, it is because, after being encoded with UTF-8, it becomes 0xEFBFBD (that is, the above byte array [-17, -65, -67]), and the combination of the two is 0xEFBFBDEFBFBD, which is the above byte array [- 17, -65, -67, -17, -65, -67].

The GBK code still adopts a double-byte coding scheme, so the above 6 bytes 0xEFBFBDEFBFBD are split into three 2-byte characters, namely 0xEFBF, 0xBDEF, and 0xBFBD. ), copy (0xBFBD).

Guess you like

Origin blog.csdn.net/vcit102/article/details/131736949