When asked in an interview what is the difference between Unicode and UTF-8, you need to answer this way

Reference for this article: Zhihu , and written in combination with my own development experience


Background: A few days ago, the younger brother came over and mentioned the difference between Unicode and UTF-8 when he was interviewed. The code is UTF-8, so Baidu has left a record to clarify the relationship between the codes.

Why are coding issues confusing at first?

Because when I started to learn the basics of java, watching the video only briefly mentioned encoding GBK (a set of encodings defined by the Chinese), but when I learned J2EE later, I directly contacted UTF-8, so far no matter from browser to http transmission The protocol, to the compiler, and then to the underlying database. Anyway, as long as there are garbled characters, it is known that the encoding problem can be solved by directly changing the configuration to UTF-8. If the problem is solved, no further research is needed. Therefore, the encoding problem is only in the actual work development. Know it, but don't know why.

So what is encoding?

When it comes to coding, I also started to understand it from "In-depth Understanding of Computer Systems". In the computer world, there are only 1 and 0. All data, all logic, and all commands are expressed by the combination of 10. Of course, coding is no exception.
A unit of 1 or 0 is described in computer terms as a bit. One bit represents 1 or 0. For example, two bits can form 4 combinations, 00, 01, 10, 11. Because the representation of bits is too small and difficult to distinguish, people gradually stipulate that 8-bit combinations are used as the basic unit of computer description, which is called Byte, and then bytes or combinations of bytes are used to describe information, such as 0000 0001 means XXX, 0000 0002 means YYYY, etc.

Computers were originally developed by Americans, using the Latin alphabet, plus some special symbols, they have about 127 characters. The amount of information represented by one byte of a computer is 2∧8 = 256, so Americans obviously use one byte to represent the characters they use. After that, a dictionary is required to mark the specific byte representation. What character, the Ascii code table appears at this time, which describes for example:
0×10 (0001 0000), the terminal will wrap;
0×07 (0000 0111), the terminal will beep to people;
0x1b, (0001 1011 ), the printer prints the words in reverse, or the terminal displays the letters in color.

Because a byte is represented by 8 bits, that is, 01010... etc. are repeated eight times, so people are used to using hexadecimal to describe bytes.

With the development of time, more and more characters need to be recorded. Obviously, 127 characters are not enough, so the United States also expanded the original Ascii and used 256 characters directly.

The introduction of computers into China

There are more than 3,000 Chinese characters in China, and the combination of a single byte is definitely not enough to describe so many contents, and it is impossible for people in China to be familiar with English because of the computer? Therefore, the experts added one more byte on the basis of the original Ascii, and described Chinese characters through the combination of two bytes (2∧16 = 65536), so they defined a new set of encoding table "GB2312", Let the Chinese also make Chinese characters smoothly described in the computer system.

The first "GB2312" just described some of the more commonly used ones, but like the United States, with the complexity of computer applications, the existing encoding can no longer meet the actual needs, so "GBK" is defined on the basis of "GB2312". The character set was later expanded to "GB18030" in order to allow the characters of ethnic minorities to be described in the computer.

computer globalization

When the Internet develops to the whole world, each country has its own set of codes. For example ("simple example, not true") in China, 2 bytes 0xbb12 are used to represent "I", but after decoding in India, It has become garbled, and the Chinese computer uses "GBK", which cannot correctly input the Indian symbols. Therefore, in order to adapt to the problem of globalization, ISO organizations emerged, and their appearance is to solve the unification of global national computer codes.

The initial solution of the ISO organization was to propose Unicode encoding. Unicode encoding is similar to GBK encoding. It uses 2 bytes to describe the characters of various countries in the world. No matter what characters are, they are represented by two bytes. Exception, so again, with Unicode encoding, one Chinese (full-width) equals two bytes.

In network transmission, striving to transmit short content can reduce network traffic, so this exposes a problem with Unicode. If the content of a large piece of data is all English characters, it will lead to the first of the two bytes. The byte representation is 0000 00000, which is an extreme waste of space and network bandwidth, so ISO has proposed another encoding to replace Unicode, that is, UTF-8. UTF-8 does not have a fixed number of bytes per character limit. , some are one byte, some are two bytes, and some are three bytes. It should be noted that Chinese uses UTF-8 encoding to occupy three bytes, which is also different from traditional encoding. UTF-8 -8 uses different numbers of bytes to describe this, which involves some algorithms and will not be discussed, but the advantage of UTF-8 is that it can maximize the use of computer space and speed up the transmission rate of the network (of course Now, if the transmission is all Chinese, GBK should save more space, but is it possible?).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326063893&siteId=291194637