Why are there garbled characters? What is codec? Why are there so many character sets?

With feelings and dry goods, WeChat search [ San Taizi Ao Bing ] pays attention to this different programmer.

This article has been included on GitHub https://github.com/JavaFamily , with the complete test sites, materials and my series of articles for interviews with major companies.

Preface

I believe everyone has encountered garbled codes. Today, my girlfriend Sanwai hurriedly came up to me: "My dear, why does my idea output garbled codes?"

I managed to do it for him after a meal, but Sanwai deserves to be my girlfriend of Mogujie. The curiosity is the same as mine, just follow me.

Then why are there garbled characters?

What is encoding and what is decoding?

What is a character code and what is a character set?

Why should there be Unicode? What is the difference between UTF-8 and GBK?

Sanwai sat on my lap and spoke to me like a coquettish series of questions. I am a fan but a girlfriend, so I have this article.

Why is there garbled

We know that what is stored in a computer is only a byte stream composed of 0 and 1, and numbers alone cannot meet our needs. We also need text processing, etc., but computers only recognize numbers, so we need to tell the computer what numbers What character it represents .

For example, I specified the representative of 0000 A, 0001 B represents the computer will know this, so I am trying AB two characters stored in the computer, it is actually stored 0000 0001, in fact, is equivalent to customize each character a unique code .

But this is my designation. Different people have different ideas. For example, Xiao Ming likes 1000 for A and 1111 for B. Then Xiao Ming’s computer is stored in the encoding method he specified, that is 1000 1111, after it is transmitted to my computer, I take Then 1000 1111, according to my code, it may be %&, which is garbled.

So the essence of garbled code is that there is no correspondence between encoding and decoding .

Some students may not be familiar with the concepts of encoding and decoding, let me explain:

  • Encoding: In fact, it is the process of converting characters into a byte stream according to a certain format.
  • Decoding: is to parse the byte stream into characters.

It can be seen that random encoding will result in the situation that the respective computers cannot correctly parse it, so there needs to be a standard, and everyone uses that standard to specify the correspondence between characters and numbers.

Standard character encoding

The American National Standards Institute ANSI has developed a standard, the American Standard Code for Information Interchange (ASCII), which specifies the set of commonly used character sets and the corresponding digital numbers. For example, 65 means A.

ASCII is actually a 7-bit encoding, expressed in binary code, it is 0000000~1111111, but 1 byte is 8 bits, so 8 bits are generally used for storage. You can see that ASCII represents 128 characters. This is actually an American encoding. Look at the United Kingdom, which also speaks English, there is no pound mark on ASCII.

There are Korean, Japanese, etc., let alone Chinese.

1 byte can only represent 256 characters at most, so it is not enough for us, so it needs to be extended. For example, GB2312 is the "Chinese Character Coded Character Set for Information Exchange" issued by the State Administration of Standards. Later, GBK was released. K is the meaning of expansion. On the basis of GB2312, many characters such as traditional characters are added.

Therefore, each country has its own standards, because the languages ​​are different, and the differences in the character sets make the communication of documents between computers very difficult, so everyone has started a wave of standardization.

For example, the ANSI organization in the United States has formulated the ANSI standard character encoding, which is actually the default encoding of the platform. For example, the Chinese operating system uses GBK, if it is the United States, it uses ASCII, and the operating system will pre-install these standard character sets.

But this can only solve the situation of one document and one character encoding. Suppose my document contains Japanese, French, German, Russian, Chinese, what do you say?

Unicode

So another Unicode was created, also known as Unicode, Universal Code, and Single Code .

The Unicode character set covers all the characters currently used by humans, and each character is numbered uniformly, and a unique character code is assigned. You see that this kind of thing must be done by someone, otherwise there will be no uniformity.

Here are a few terms for me to explain to make everyone more clear.

  • Characters: In fact, just like English letters, or our Chinese are called characters
  • Character set: that is the set of characters and numbers
  • Character code: It is the number corresponding to the character in the character set, or the number, for example, in the ASCII character set, the character code of A is 65
  • Character encoding: According to the mapping relationship between characters and numbers in the character set, the realization of the conversion into a byte stream

For Unicode, one thing is different from the previous encoding, it decouples the character set and encoding.

The previous encodings such as ASCII encoding, GBK encoding, etc., their character set and encoding implementation are tied up, you can understand that the previous encoding is actually a lookup table, there is a fixed table to store this character and the corresponding fixed binary For example, the number corresponding to A is 65, and its binary sequence is 01000001.

Unicode is different. It separates the character set and the character encoding. For example, the number corresponding to A is 65, but the corresponding binary sequence is not necessarily the same. It depends on the specific character encoding. If it is UTF-8 encoding, it is 01000001, if it is UTF-16 encoding (big endian), it is 00000000 01000001.

This is actually the reason why we now use UTF-8 instead of UTF-16. It can be seen that the UTF-16 encoding has a low storage efficiency, at least two bytes are used, and many functions of the C language will use 0x00 bytes as The stop character of the string is parsed, so I made a UTF-8, which uses 1~4 bytes to encode each character, which is variable length. I will not say how to encode it, just check it out. .

At last

So far we have clarified the source of the garbled code, and also know why there are so many character encodings. After all, there are many languages, and ASCII was first released, but it is not enough for other countries, so we have expanded them separately.

However, there are more encodings, and it is difficult to achieve uniformity and compatibility between countries. Therefore, international organizations later formulated a Unicode character set, which unified all characters, and separated the character set and the encoding to make the encoding more flexible. Come.

By the way, why there are no garbled characters in English is because most character sets are based on ASCII extensions, so they are compatible with ASCII.

This issue should be regarded as a more interesting popular science series, but I am still eager for your praise haha.

Talk

Ao Bing compiled his interview essay into an e-book with 1,630 pages!

Full of dry goods, the essence of every word. The content is as follows, as well as the interview questions and resume templates that I summarized during the review, which are now given to everyone for free.

Link: https://pan.baidu.com/s/1ZQEKJBgtYle3v-1LimcSwg Password:wjk6

This is Ao Bing. The more you know, the more you don’t . Thank you all for your praise , favorites and comments . See you in the next issue!


Article continually updated, you can search a search micro-channel " Third Prince Ao propionate " the first time to read, reply [ data ] first-tier manufacturers have interview data and I am ready to resume template, the paper GitHub https://github.com/JavaFamily already included , There are complete test sites for interviews with major factories, and Star is welcome.

Guess you like

Origin blog.csdn.net/qq_35190492/article/details/109091892