Rambling: how to explain to his girlfriend what is "Kun pounds copy"?

Weekend girlfriend out shopping, I am a person at home watching a variety show, suddenly, his girlfriend called me.

After a while, his girlfriend came back, took out her cell phone, showed me her photo shoot in the supermarket:


To know what is garbled, you need to start talking about computer coding.
And ASCII character encoding

We often look at some spy war drama, spy war drama enemy agents, between underground party members and the Eighth Route Army when sending information, usually sent by telegram, telex in the transmission process, the need to telegraph key members of electricity does not emit length one code, to close at members will hear pieces telegraph sent by ticking dripping sound. In fact, the sound telegrams are "drop" and a combination of "A" and "A" sound is "drop" three times as long.

Signaller by means of a first way, the information you want to send turn into a ticking of the telegraph, to close at staff after hearing the tick, and then translate them into normal text. This process is the character encoding and character encoding.

Intelligence spy war drama will turn into a telegraph "drop" and "A" sound, mainly through Morse code , which is a different way to express the character encoding of letters, numbers and punctuation by a different sort order. Morse code of short and long electrical pulses (referred to as dots and dashes) composed. Point and length of time has a predetermined stroke, in that a basic unit, a stroke equal to the length of the three points. Exactly correspond to "drop" and "A" on the telegram.

Like the telegraph can only issue a "drop" and "A" sound, like a computer known only 0 and 1 two character, however, the human character is diverse, how to convert text into computer-human characters do know 01 this process also need to pass character encoding.

The character encoding (Character encoding) is a set of rules, which rule can be used a set of natural language characters (alphabet or a syllabary), paired with a set (e.g., number or pulses) other things.

And Morse code function is similar to the 1960s, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform provisions, which are called ASCII code, still in use.

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Exchange) is a system of computer code based on the Latin alphabet. It is mainly used to display modern English, in which a total of 128 characters, contains all uppercase and lowercase letters, numbers 0-9, punctuation, and special control characters like used in American English.

Because only 128 ASCII characters, although the English character can be expressed, but the world there are many other words that he is no way represented, so the need for a more comprehensive character encoding.

Before introducing other character encoding, we first look for a general-purpose computer field character set.

Unicode

Unicode (Chinese: Unicode, the international code, Unicode, single) in the field of computer science is an industry standard. Most of its writing system in the world were consolidated, coding, so that the computer can be used to present a more simple way, and word processing.

Unicode is still constantly upgrading, each new release add more new characters. The latest version of 12.1 in May 2019 announced, this version only adds a character that Japanese New Year orders and the co-word.

Unicode highly recognized and widely used in the process of internationalization and localization of computer software. There are many new technologies, such as Extensible Markup Language (Extensible Markup Language, abbreviated: XML), Java programming language as well as modern operating systems have adopted Unicode encoding.

Unicode is a universal character set that contains most of the text on the world, that is, Unicode can represent the Chinese.

UTF-8 UTF-16 UTF-32

Although unified Unicode character encoding of the world, but does not specify how to store. This is allowed to consider:

If Unicode unified regulations, each symbol is represented by three or four must bytes, because too many characters, can only use so many bytes to represent completely.

Once such a provision, so before each letter are bound to have two or three bytes is 0, because all the letters of the alphabet have, can be used in an ASCII bytes, the remaining bytes will supplement the position 0.

If so, the size of the text file will be large and therefore a two to three times, this is a great waste for storage is. To solve this problem, there have been some character set intermediate format, they are called Universal Transformation Format, namely UTF (Unicode Transformation Format). Common formats UTF: UTF-7, UTF-7.5 , UTF-8, UTF-16, and UTF-32.

UTF-8 uses one to four bytes per character encoding

UTF-16 using two or four bytes for each character encoding

UTF-32 using four bytes for each character encoding

So we can say, UTF-8, UTF-16, etc. are an implementation of Unicode.

For example, Unicode provides a Chinese character "I" corresponds to the unicode is "\ u6211", however, in different implementations of UTF-8 and UTF-16, etc., the binary code of the storage is not the same.

UTF-8 uses variable length bytes to store Unicode characters, for example, 1-byte ASCII characters continue to be used to store, accented characters, Greek or Cyrillic alphabet uses 2 bytes to store the like, and commonly used kanji characters necessary to use 3 section. Auxiliary plane 4 byte character is used.

GBK,GB2312,GB18030

Because UTF-8 is an implementation of Unicode, so he contains all the text encoding of the world, he uses a 1-4 byte encoded.

For those words into the top surface of the priority, the priority might use 1 byte, 2 bytes of storage, and for the inclusion of the text, it is necessary to use 3 bytes or 4 bytes is stored.

It is because he is too full, so some of those late into the characters, the number of bytes stored in UTF-8 share in might be more, he's storage space requirements will be enormous.

For the commonly used Chinese characters used in UTF-8 3-byte encoded, but if there is a Chinese and contains only ASCII encoding, then, there is no need to use 3 bytes, 2 bytes may be enough.

For most sites, the basic service is only one country or region, such as a Chinese web site, simplified and traditional Chinese characters usually appear as well as some English characters rarely appear in Japanese or Korean.

Also out of this consideration, China's State Administration of standards enacted in 1981 and implemented GB 2312-80 encoding, that is the national standard character set Simplified Chinese People's Republic of China. Later, Microsoft vendors use GB 2312-80 encoding unused space, included all the characters GB 13000.1-93 developed GBK coding.

With standard Chinese character set, if it is a pure Chinese website, this encoding can be used, which can greatly save some storage space.

Commonly used Chinese encoding GBK, GB2312, GB18030, the most commonly used is GBK.

  • GB2312 (1980): The 16-bit character set, contains simplified Chinese characters have 6763, 682 symbols, a total of 7445 characters;

    • Pros: available in the Simplified Chinese environment, belong to the Chinese national standards, spoken in mainland China and Singapore also use this coding;

    • Cons: Traditional Chinese is not compatible with its collection of characters too little.

  • GBK (1995): The 16-bit character set, a collection has 21,003 Chinese characters, 883 symbols, totaling 21,886 characters;

    • Advantages: suitable for Simplified and Traditional Chinese coexistence environment for Windows use simplified down completely compatible gb2312, supports up ISO-10646 international standard; all characters can be mapped to one unicode2.0;

    • Disadvantages: does not belong to official standards, and the need to convert between big5; many search engines are not well supported GBK Chinese characters.

  • GB18030 (2000): The 32-bit character sets; contains 27,484 Chinese characters, while a collection of Tibetan, Mongolian, Uygur, the major ethnic languages.

    • Advantages: able to include all you can think of words and symbols, is the latest Chinese national standards;

    • Cons: there is less support for its software.

Garbled

Examples of telegram We also get introduced earlier, assume the following scenario:

Signaller using the "American Morse code" to convert the information into a telegraph, to close at staff after receiving the telegram, decipher the "Modern International Morse code." So get the information content could totally do not understand, this is garbled.

Just like in the computer field, we put a string of Chinese characters through the UTF-8 encoded and transmitted to someone else, then someone else got this text string, by decoding the GBK, the resulting content would be "Kun Kun pounds copy sessions Fan Bing Mu Kun Kun pounds pounds copy straight copy Kun ", which is garbled.

As the following code:

public static void main(String[] args) throws UnsupportedEncodingException {
    String s = "漫话编程!";

    byte[] bytes = s.getBytes(Charset.forName("GBK"));

    System.out.println("GBK编码,GBK解码:" + new String(bytes, "GBK"));

    System.out.println("GBK编码,GB18030解码:" + new String(bytes, "GB18030"));

    System.out.println("GBK编码,UTF-8解码:" + new String(bytes, "UTF-8"));
}
复制代码

Output:

GBK编码,GBK解码:漫话编程!
GBK编码,GB18030解码:漫话编程!
GBK编码,UTF-8解码:????????
复制代码

You can see the Chinese characters by GBK encoding, decoding and then use UTF-8 character it is to get a bunch of question marks, which is garbled.

Kun pounds copy of Past and Present

Because Unicode is always updated, in the process, there must be some relatively new character he is not represented. Unicode or even released a new version incorporates a text, but many software upgrades will not have such problems.

Emoji expressions like those living in some of the new mobile phone manufacturers can display properly on their mobile phone, send to other brands of mobile phones might not be displayed. Actually, this is the character set does not support the cause.

When the above happens, time can not be displayed also need to have a character represented in Unicode, this character is that he is a special character defined in Unicode. That is, "0xFFFD REPLACEMENT CHARACTER", all characters will not be represented is represented by this character.

About this has Unicode official symbols, can be seen from the table, he is 10 hexadecimal 65533, under the UTF-8, His hexadecimal form '0xEF 0xBF 0xBD' (three bytes) .

If there are two consecutive characters can not be displayed, such as "", then in UTF-8 encoding, hexadecimal notation:

0xEF 0xBF 0xBD 
0xEF 0xBF 0xBD
复制代码

Above this encoding, if placed in GBK decoding words, because a character in GBK two bytes, then the result is:

0xEF 0xBF, 0xBD 0xEF, 0xBF 0xBD
复制代码

which is

0xEFBF
0xBDEF
0xBFBD
复制代码

So, if displayed, that is: Kun (0xEFBF), pounds (0xBDEF), copy (0xBFBD), so later to see Kun pounds copy, the first time think of UTF-8 and GBK conversion issues go wrong.

In addition to Kun pounds copy outside, there are two more classic garbled, are "hot hot hot" and "Tun Tun Tun", two garbage generated from VC, which is the VC initialization of memory in debug mode. VC will stack the newly allocated memory is initialized to 0xcc, while the newly allocated heap memory is initialized to 0xcd. According to the 0xcc and 0xcd characters printed, it is hot and Tuen.



Guess you like

Origin juejin.im/post/5d6498f1e51d456210163bc8