Read character encoding in one article

foreword

Speaking of character encoding, it reminds me of the sci-fi masterpiece "Three-Body-Dark Forest" where humans encounter the alien civilization Lord of the Rings

For the first time, humans saw the four-dimensional object Lord of the Rings at close range,

Zhuowen sent a greeting with medium frequency radio waves. This is a simple lattice diagram, in which six rows of different numbers of dots form a prime number sequence: 1,3,5,7,11,13.

They didn't expect an answer, but it came right away

The spaceship received a series of dot matrix images from "Lord of the Rings". The first picture is a very neat 8×8 dot matrix with sixty-four dots in total; one corner of the dot matrix in the second picture is missing a dot , there are sixty-three left; in the third picture, there are a little less, and there are sixty-two left..."This is a countdown, which is also equivalent to a progress bar, which may indicate that 'it' has received Rosetta and is Jiejie, let us wait." West said.

"But why sixty-four?"

"When using binary, it is a moderate number, which is about the same as one hundred in decimal."

Both Zhuo Wen and Guan Yifan were very fortunate to bring West here. Psychologists are indeed very talented in establishing communication with unknown intelligent bodies.

When the countdown reached fifty-seven, something exciting happened: the next count was not represented by a dot matrix, and the human Arabic numeral 56 was impressively displayed on the picture sent by "Lord of the Rings"!

In the cosmic journey of human beings exploring alien civilizations and marching towards the stars and seas, this most basic coding problem is also inseparable. A while ago, I encountered a problem of garbled characters with my colleagues. After learning about it, I found that this problem has existed for two years. We deal with encoding every day, but everyone has only a half-knowledge about character encoding. We "eat pork every day but rarely see pigs run away" ", today we will explain it thoroughly.

What is character encoding

We know that there are only 0 and 1 in the computer world. If there is no character encoding, what we see is a string of "110010100101100111001...". Our communication is like playing the piano against the cow. I can't understand it, it can't understand me. Character encoding is like a translation program between humans and machines. It translates the characters we know well into binary that machines can understand, and at the same time translates binary into characters that we can understand.

The following is Wikipedia's explanation of character encoding

Character encoding (Character encoding), also known as character set code, is to encode the characters in the character set into an object in the specified set (for example: bit pattern, natural number sequence, 8-bit group or electric pulse), so that the text can be displayed in the computer storage or communication network delivery. A common example is to encode the Latin alphabet into Morse code and ASCII. For example, ASCII encoding is to number letters, numbers and other symbols, and use 7-bit binary to represent this integer.

Character set (Character set) is a collection of multiple characters. There are many types of character sets. Each character set contains different numbers of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set , Unicode character set, etc. In order for a computer to accurately process various character set texts, character encoding is required so that the computer can recognize and store various texts.

Why Computers Need Code

Encoding (Encode) is the process of converting information from one form to another, such as converting characters (words, numbers, symbols, etc.), images, sounds or other objects into prescribed electrical pulse signals or binary digits.

The pictures we see, the music we hear, and even the lines of code we write, the characters we type, all what we see and hear are so real, but in fact There are a series of "01" numbers behind it. The girl you saw on your phone yesterday does not exist in the real world. It is just a "skeleton" generated by the computer using the "01" numbers for you.

picture

Binary doesn't really exist

You may think that the data in the computer is "01" binary, but in fact there is no binary in the computer, even though we know that all the content is stored in the hard disk, but you can't find any "0101" in it if you take it apart ", there are only platters and tracks inside. Even if we zoom in to see the platter, there are only uneven disk surfaces, the raised places are magnetized, and the concave places are not magnetized; it’s just that we gave the raised places a name called a number” 1", and the recessed place is called the number "0".

Similarly, you can’t find binary numbers in the memory. When the memory is enlarged, it looks like a bunch of capacitors. Whether the memory cell stores “0” or “1” depends on whether the capacitor has a charge. If there is charge, we think it is “1”. No charge thinks he is "0". But the capacitor will discharge. Over time, the capacitor representing "1" will discharge, and the capacitor representing "0" will absorb electricity. This is why our memory cannot be powered off. The capacitor needs to be charged regularly to ensure that "1" "The capacitance of the battery is charged.

Besides the display, this is the most direct feeling for everyone. The beautiful women’s painted skin, the sun, the moon and the mountains and rivers you see through the display are actually light spots of different colors and strengths emitted by light-emitting diodes. The display is composed of a group of light-emitting diodes. Matrix, where each diode can be called a pixel, "1" means on, "0" means off, and we can usually see colorful colors, it is made of light-emitting diodes of three colors (red, green and blue) got together. How can an ASCII code "65" be displayed as "A" in the end? This is the credit of the graphics card, which stores the graphic data (also called font code) of each character in the graphics card, and transmits the graphic data of the two-dimensional matrix to the display for imaging.

picture

Therefore, the so-called 0 and 1 are current pulse signals, and binary is actually a mathematical logic concept abstracted by us, so why do we use binary to represent it?

Because there are only two states in binary, each bit in binary can be represented by using physical devices with two stable states. 0" and "1", which provide convenient conditions for computers to realize logical operations and logical judgments.

computer code conversion process

Because the computer can only represent the logical concept of "01" and cannot directly represent pictures and text, we need a certain conversion process.

This is actually that we maintain the character-number mapping relationship according to certain rules . For example, we abstract "A" into "1" in the computer. When we see 1, we think it is "A", which is essentially A mapping table, in theory, you can assign a unique number (character code, character encoding) to each character at will , such as the following table

character serial number
you 1 (00000001)
good 2 (00000010)

Next, let's look at the simple process of the next text from input-transcoding storage-output (display/print). First of all, we know that computers were invented by Americans, the rules are set by Americans, and the keys on the keyboard are also in English. Letters, so numbers are not assigned however you want. For the input of English letters, there is a direct correspondence between the keyboard and the ASCII code. The number "65" corresponding to the keyboard key "A" is stored on the disk as the binary literal translation of "65" "01000001", which is easy to understand.

But this is not the case for Chinese character input. There are no input buttons corresponding to Chinese characters on the keyboard, and it is impossible for us to directly type out Chinese characters. So there is a conversion relationship between input code, internal code, and font code. Input code helps us convert English keyboard keys into Chinese characters, internal code helps us convert Chinese characters into binary sequences, and font code helps us convert binary sequences . Output to monitor imaging.

picture

input code

Let's simulate the input process of Chinese characters. First, open the txt text and type the pinyin letters of "nihao", then a number of qualified Chinese character phrases will pop up in the input column, and finally we will select the corresponding number to realize the input of Chinese characters. How is this process achieved?

In the computer field, there is a sacred philosophic saying like the Ten Commandments of Moses: "Any problem in the field of computer science can be solved by adding an indirect middle layer."

Here we add another layer of mapping tables between key letter combinations and Chinese characters, such as English-Chinese dictionaries. This layer is called input code. The process from input code to internal code is a table lookup conversion operation, such as the ASCII characters "nihao". , you can modify the mapping table and the candidate number at will, and I can map him to "Hi Hao Xiao Yang".

picture

machine code

The internal code, also known as the internal code, is the core part of character encoding . It is the binary code used for the actual storage, exchange, and communication of the character set in the computer. Through the internal code, we can achieve the purpose of storing and transmitting text efficiently. Our external code (input code) realizes the mapping conversion between keyboard keys and characters, but the internal code makes the characters truly become a binary language that the machine can understand.

font code

The characters in the computer are expressed in the binary form of the internal code. How do we display the characters corresponding to the numbers on the monitor? For example, the number "1" represents the Chinese character "you", how to display "1" as "you"?

This needs to rely on the font code, which is essentially an n*n pixel matrix , setting the pixels in some positions to white (indicated by 1), and setting the pixels in other positions to black (indicated by 0), each The glyphs of characters are pre-stored in the computer, and such a glyph information library is called a font library .

For example, the dot matrix of Chinese "You", such a 16*16 pixel matrix, needs 16 * 16 / 8 = 32 bytes of space to represent, and the font information on the right is called font code . Different fonts (such as Song Ti, Hei Ti) have different glyph codes for the same character.

picture

Therefore, the character code to the displayed font code is actually another lookup table, which is the mapping relationship table between character code and font code .

In fact, we can also think that character encoding is a compression method of font codes , and a 32-byte pixel matrix is ​​compressed into a 2-byte internal code.

font encoding font code
country picture
you picture
middle picture

History of character encodings

telegraph code

In a broad sense, encoding has a long history and can be traced back to the ancient times of knotting and keeping records. However, the invention of Morse code that is closer to modern character encoding has opened the door to the information and communication era since then.

Morse code was invented by American Morse in 1837. It is more than 100 years earlier than ASCII. It played a very important role in early radio. It is a must for every radio communicator. It is made up of points Composed of the two symbols dot "." and dash "-", they are expressed as short drops and long clicks in telegrams, which are also binary codes just like binary. One binary is definitely not enough to represent our letters, so use multiple binary to represent, for example, tick ".-" represents the letter "A", and tick tick "-..." represents the letter "B".

picture

Morse code table

coding era

When the computer was first invented, it was used to solve mathematical calculation problems. Later, people discovered that the computer could do more things, such as text processing. After considering the communication problem of the machine, the major manufacturers also do their own things, develop their own hardware and their own software, and encode as they want.

Later, when the machines needed to communicate with each other, they found that the characters displayed on different computers were different. The number "00010100" on the IBM system represented "A", but it was displayed as "B" on the Microsoft system. Everyone was dumbfounded. So the American Standardization Organization came out to formulate the ASCII code (American Standard Code for Information Interchange), which unified the rules of the game and stipulated which binary numbers are used to represent common symbols.

Let a hundred flowers bloom

picture

The unified ASCII code standard is very happy for English-speaking countries, but ASCII code only considers English letters. Later, when computers spread to Europe, the French need to add a letter symbol (such as: é), and the Germans need to add a few letters (? ? , ? ?, ü ü, ?), fortunately ASCII only used the first 127 numbers, so Europeans coded the unused codes (128-255) of ASCII into their own unique symbol codes, and they can also play together very well .

But after it was introduced to China, the Chinese language as a broad and profound language was completely lost. We have tens of thousands of Chinese characters, and 255 numbers are not enough, so we have the later multi-byte encoding... Therefore, all countries have The code table of the national language was introduced, and there were later ISO 8859 series, GB series (GB2312, GBK, GB18030, GB13000), Big5, EUC-KR, JIS... However, in order to be common in computer systems, these extensions All encodings are directly or indirectly compatible with ASCII codes.

In order to sell their products to the world, international manufacturers such as Microsoft/IBM need to support the languages ​​of various countries and adopt local coding methods in different places, so they concentrate the coding methods of the world in Numbered together, and named code page (Codepage, also known as internal code table), so we sometimes see xx code page to refer to a certain character encoding, such as Chinese GBK in Microsoft system The encoding corresponds to the 936 code page, and the traditional Chinese Big5 encoding corresponds to the 950 code page.

These character encodings that are both compatible with ASCII and incompatible with each other are later collectively referred to as ANSI encoding. You may be familiar with the picture below. We basically use ANSI encoding to save under the window.

picture

The literal meaning of ANSI does not refer to character encoding, but a non-profit organization in the United States. It is the abbreviation of American National Standards Institute. ANSI has done a lot of standard-setting work for character encoding. Such confusing multibyte encodings are called ANSI encodings or standard code pages.

ANSI encoding is just a generic name, which generally represents the default encoding method of the system, and it is not a certain encoding method—for example, in the Windows operating system, ANSI encoding in China refers to GB encoding, and ANSI encoding in Hong Kong refers to GB encoding. The Big5 code is the Big5 code, and the ANSI code in Korea refers to the EUC-KR code.

unification of the world

Since each country has its own character encoding, what if some people want to pretend to be two sentences in Korean in Chinese? Sorry, your compulsion level is too high to support, you can only type Chinese characters if you choose GB2312. At the same time, major international manufacturers are also suffering from the problem of compatibility with various character encodings, so they decided to develop a set of encodings that can accommodate all characters in the world, and then there is the famous Unicode.

Unicode is also called Universal Code, including character sets, encoding schemes, etc. Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary code for each character in each language. In this language environment, there will be no more language encoding conflicts, and content in any national language can be displayed on the same screen. This is the biggest benefit of Unicode.

In the Unicode encoding scheme, there are four common encoding implementation schemes UTF-7, UTF-8, UTF-16, UTF-32, the most well-known is UTF-8, but Unicode was originally designed to use double-byte fixed-length encoding UTF-16, but found that the historical burden was too heavy to push, and finally a variable length UTF-8 was widely accepted.

character encoding model

traditional coding model

picture

In the traditional character encoding model, the characters in the character set are basically numbered one by one in decimal, and then the decimal number is directly converted into the corresponding binary code. It can be said that the character number is the character code.

When the computer processes the conversion relationship between characters and numbers, it is actually the process of looking up the mapping table. For example, ASCII encoding is to encode a unique number for each English character. The entire encoding process is relatively simple. The computer is directly mapped to binary, and the decimal number is just for our convenience.

character decimal binary
A 65 01000001
B 66 01000010
C 67 01000011

modern coding model

picture

The Unicode encoding model adopts a new encoding idea, which divides the encoding model into 4 levels, and some say 5 levels, but the fifth level is the encoding adaptation of the transport layer, which is strictly speaking not very appropriate in the encoding model. .

  • The first layer, the abstract character set ACR (Abstract Character Repertoire): defines the abstract character set, and defines each abstract character;
  • The second layer, numbered character set CCS (Coded Character Set): digitally number the abstract character set
  • The third layer, the character encoding method CEF (Character Encoding Form): encode the character number into a logical code element sequence
  • The fourth layer, character encoding scheme CES (Character Encoding Scheme): encode the logical code unit sequence into a physical byte sequence

The first layer: abstract character set ACR

The so-called abstract character set is a collection of abstract characters, which is an unordered collection. Here it is emphasized that the characters are abstract, that is, not only the narrow sense characters that we can see visually, such as tangible characters such as "a", but also Some invisible characters that we cannot see, such as some control characters "DELETE", "NULL", etc.

Another meaning of abstraction is that some glyphs are composed of multiple characters. For example, "?" in Spanish is composed of two characters "n" and "~". In this regard, Unicode is different from traditional encoding standards. Traditional encoding The standard probably treats ? as a single character, while Unicode treats it as a combination of two characters.

At the same time, a character may also have multiple visual glyph representations. For example, a Chinese character has multiple forms such as Kai, Xing, Cao, and Li, which are all regarded as the same abstract character (that is, the character set encoding is for characters rather than glyphs. encoding), how to display is a matter of the font library.

picture

Different forms of the Chinese character "人"

Abstract character sets can be divided into open and closed. An open character set refers to a character set that continues to add characters, and a closed character set refers to a character set that does not add characters. For example, ASCII is closed, with only 128 characters, and will not be added in the future, but Unicode is open, and new characters will be added continuously. It has increased from the initial 7163 characters to the current 144,697 characters.

The second layer: numbered character set CCS

The numbered character set is to number each character in the abstract character set and map to a set of non-negative integers ;

Numbers are generally expressed in decimal and hexadecimal notations that are convenient for human reading. For example, the character number of "A" is "65", and the character number of "B" is "66";

Everyone needs to be clear that the number of some character codes is the stored binary sequence , such as ASCII code; the number of some character codes is not the same as the stored binary sequence, such as GB2312, Unicode, etc.

In addition, the numbered character set has a limited range. For example, the range of the ASCII character set is 0 127, the range of ISO-8859-1 is 0 256, and GB2312 is represented by a 94*94 two-dimensional matrix space, and Unicode is represented by Plane This is represented by the concept of a flat space, which is called the numbering space of the character set.

A location in the numbering space is called a code point (Code Point code point). The coordinates of the code point occupied by a character (a pair of non-negative integer values) or the represented non-negative integer value is the code value (code point number) of the character.

picture

ASCII code point number

The third layer: character encoding method CEF

The abstract character set and the numbered character set are viewed from the perspective of our understanding, so in the end we need to translate into a language that the computer can understand, and convert the decimal number into a binary form.

Therefore, the character encoding method is the process of converting the code point number of the character set into a binary code unit sequence (Code Unit Sequence).

Code unit: The smallest processing unit of character encoding. For example, one character in ASCII is equal to one byte, which belongs to a single-byte code unit; one character in UTF-16 is equal to two bytes, and the processing process is processed according to the word "word", so it is Double-byte code unit; UTF-8 is a multi-byte encoding, there are single-byte characters and multi-byte characters, and each processing is processed by a single byte, so the smallest processing unit is a byte, which also belongs to a single Byte code element

Here you may have doubts, wouldn’t it be good to convert decimal directly to binary? Why should such a layer be extracted separately?

The early character encoding is indeed handled in this way. The conversion between decimal and binary is done directly. For example, in ASCII code, the decimal of the character "A" is "65", and the corresponding binary is "1000001", which is stored in the hard disk at the same time. It is also this binary, so the encoding at that time was relatively simple.

With the emergence of multi-byte character encoding (Muilti-Bytes Character Set, MBCS multi-byte character set), there is no direct conversion between character number and binary, such as GB2312 encoding, the location number of "ten thousand" is " 45, 82", the corresponding binary internal code is "1100 1101 1111 0010" (the decimal is "205, 242").

What will happen if there is no conversion here and it is directly mapped to binary code? The character number of "ten thousand" is "45, 82", 45 is "-" in ASCII, and 82 is "U", so whether to display two characters "-U" or one character "ten thousand", in order to avoid This kind of conflict adds prefix processing, and the detailed process will be explained in detail below.

The fourth layer: character encoding scheme CES

The character encoding scheme is also called "serialization format" (Serialization Format), which refers to the mapping of the code unit sequence after encoding the character number into a byte sequence (ie, a byte stream), so that the encoded characters can be Processing, storage and transmission in a computer.

The character encoding method CEF is a bit like the logical design in our database structure design, and this level of encoding scheme CES is like a physical design, mapping the code element sequence to a binary process in the physical sense related to a specific computer system platform.

Here you may have questions again, why is the binary code element sequence different from the actual stored binary? This is mainly caused by the big and small endian sequence of the computer. The specific endian content will be introduced in detail in the UTF-16 encoding section.

Big and small endian nouns come from the book "Gref's Travels" by Jonathan Swift:

All agree that the primitive way to eat an egg is to break the larger end of the egg. But the grandfather of the current emperor ate eggs when he was a child, and happened to break a finger while cracking eggs according to the ancient method. Therefore, his father, the emperor at that time, issued an edict ordering all the subjects to break the smaller eggs when eating eggs. On the one hand, violators are punished heavily.

The common people were extremely disgusted with this order. History tells us that six rebellions arose out of this, in which one emperor died and another lost his throne... Hundreds of great books have been published on this dispute, but the Daduan books have always been the most popular. Forbidden, the law also stipulates that anyone from this faction shall not be an official.

common character encoding

ASCII

A long time ago, computer manufacturers had their own way of rendering characters to the screen. Computers were the size of a house at every turn. This guy was not for everyone. People didn't care about computers back then. how to communicate. With the emergence of microprocessors in the 1970s and 1980s, computers became smaller and smaller, personal computers began to enter the public's sight, and then there was a blowout development, but before the manufacturers were doing their own things, they did not consider their own Products must be compatible with other people's products, which makes data conversion between different computer systems very painful. Therefore, the American Standards Institute formulated the ASCII code in 1967, and a total of 128 characters have been defined so far.

picture

ASCII encoding (note that the table is a column representing the high 4 bits of the byte)

Among them, the first 32 (0 31) are invisible control characters, 32 126 are visible characters, and 127 is the DELETE command (DEL key on the keyboard).

In fact, long before ASCII, IBM also launched a character encoding system EBCDIC in 1963, which, like ASCII, includes control characters, numbers, common punctuation, and uppercase and lowercase English letters.

picture

EBCDIC encoding

However, its character numbers are not continuous, which brings trouble to the subsequent program processing. Later, ASCII code learned the experience and lessons of EBCDIC, and assigned continuous codes to English words to facilitate program processing, so it was widely accepted later.

Compared with ASCII and EBCDIC encoding, except for the continuous arrangement of characters, the biggest advantage is that ASCII only uses the lower 7 bits of a byte, and the highest bit is always 0. Don't underestimate the highest bit 0, it seems insignificant, but this is the most successful part of ASCII design. When you introduce the coding principles later, you will find that it is precisely because of this high bit 0 that other coding standards can match the ASCII code. Seamless compatibility has made ASCII widely accepted.

ISO-8859 series

Although the American market has unified character encoding, computer manufacturers have encountered trouble when entering the European market. Although the mainstream languages ​​in Europe also use the Latin alphabet, there are many extensions, such as "é" in French and "é" in Norwegian. "?" cannot be expressed in ASCII. But everyone found that the 128 behind ASCII have not been used and can be used, which is enough for mainstream European languages.

So there is the well-known ISO-8859-1 (Latin-1), which only extends the last 128 characters of ASCII, and still belongs to single-byte encoding; at the same time, in order to be compatible with the original ASCII code, when the highest bit is 0 It still means that the original ASCII characters remain unchanged, and when the highest bit is 1, it means extended European characters.

picture

But it’s not over here. I just said that this is only the mainstream language in Europe, but the mainstream language does not have the three letters ?, ?, and ? used in French, nor the ?, ?, ?, ? used in Finnish. The 256 code points in the byte encoding have been used up, so more variants of the ISO-8859-2/3/…/16 series have appeared, all of which are compatible with ASCII, but not fully compatible with each other.

The ISO-8859-n series character set is as follows:

The ISO8859-1 character set, also known as Latin-1, is a common character in Western Europe, including the letters of Germany and France.

The ISO8859-2 character set, also known as Latin-2, collects Eastern European characters.

The ISO8859-3 character set, also known as Latin-3, collects Southern European characters.

The ISO8859-4 character set, also known as Latin-4, collects Nordic characters.

The ISO8859-5 character set, also known as Cyrillic, is a collection of Cyrillic characters.

The ISO8859-6 character set, also known as Arabic, collects characters of the Arabic family.

The ISO8859-7 character set, also known as Greek, collects Greek characters.

GB series

When computers entered East Asian countries, manufacturers were even more dumbfounded. The languages ​​​​of the United States and European countries are basically phonetic characters, and one byte is enough, but many Asian countries are ideographic characters, and the number of characters is tens of thousands. Hundreds of thousands, one byte is not enough, so the relevant departments of our country designed the GB2312 double-byte encoding according to the ISO specification, but GB2312 is a closed character set, which only contains more than 7,000 commonly used characters, so In order to expand more characters including some rare characters, there are later GBK, GB18030, and GB13000 ("GB" is the Chinese pinyin acronym for "National Standard").

According to the GB series encoding scheme, in a piece of text, if a byte is 0~127, then the meaning of this byte is the same as the ASCII encoding, otherwise, this byte and the next byte together form a Chinese character (or GB encoding definition other characters), so the GB series are compatible with ASCII encoding.

picture

GB2312

GB2312 is a coding standard that uses two bytes to represent Chinese characters. There are a total of 6763 Chinese characters and 682 non-Chinese graphic characters. The value must be greater than 127 (that is, the highest bit of the byte is 1), and two bytes greater than 127 must be connected together to represent a Chinese character (GB2312 is a double-byte encoding), so GB2312 is a variable-length encoding, when When it is an English character, it occupies one byte, and when it is a Chinese character, it occupies two bytes. It can be considered that GB2312 is a Chinese extension of ASCII.

The GB2312 character set numbering space is a 94*94 two-dimensional table, the row represents the area (high byte), the column represents the bit (low byte), each area has 94 bits, and each area corresponds to a character, called area code . Add 2020H on the location code, just obtain the national standard code , add 8080H on the national standard code, just obtain the commonly used computer internal code . Here we introduce the concepts of area code, national standard code, and machine internal code. Let's talk about the relationship between the three

National standard code

The national standard code is the standard code for information exchange of Chinese characters in China . It is stipulated that it consists of 4 hexadecimal numbers and is represented by two low 7-bit bytes. In order to avoid the first 32 control command characters in ASCII characters, each byte They all start from the 33rd number, as shown in the figure below

picture

Area code

Since the hexadecimal codeable area of ​​the above-mentioned national standard code is not intuitive enough and inconvenient for us to use, we map it into a decimal 94*94 two-dimensional table number space, which we call the area code, and the area code can also be used as a This kind of external code is used, and the input method can be directly switched to the location code for Chinese character input, but this input method has no rules at all, it is difficult for people to remember the location number, and not many people use it.

The figure below is a two-dimensional table of area codes. For example, the word "万" has 82 digits in the 45th area, so the area code of the word "万" is "45, 82".

picture

in:

  • Area 01~09 (682): special symbols, numbers, English characters, tabs, etc., including 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, Russian Cyrillic letters, etc.;
  • Area 10~15: Empty area, reserved for expansion;
  • Areas 16~55 (3755): commonly used Chinese characters (also known as first-level Chinese characters), sorted by pinyin;
  • Areas 56~87 (3008): very common Chinese characters (also called second-level Chinese characters), sorted by radicals/strokes;
  • Areas 88~94: Empty areas, reserved for expansion.

machine code

The GB2312 national standard code specification is to cover the symbols and English letters in the visible part of ASCII, and use two 7-bit codes to re-code the English letters and symbols, but this has a disadvantage that early English articles encoded with ASCII codes cannot be opened , it is garbled when it is opened, that is to say, it should be compatible with the early ASCII code instead of covering it. Later, in order to solve this problem, Microsoft set the highest bit of the byte to 1, because ASCII uses 7 bits, and the highest bit is 0. Convert The final encoding is called machine internal code (internal code). This method essentially modifies the GB2312 encoding standard, and is finally accepted and used by everyone.

Summarize the conversion relationship of the three: area code —> area code and bit code respectively + 32 (ie + 20H) to get the national standard code —> then respectively + 128 (ie + 80H) to get the internal code (no conflict with ASCII code )

picture

GBK

GBK means " national standard extension ", because the highest bit of GB2312 double-byte is required to be greater than 1, and the upper limit will not exceed 10,000 characters, so this is extended, and the characters of GB2312 are directly used without re-encoding, so Fully compatible with GB2312. Although GBK is also a double-byte encoding , it only requires that the first byte be greater than 127 to indicate that this is the beginning of a Chinese character. Because of this, the encoding space of GBK is much larger than that of GB2312.

The overall coding range of GBK is 8140-FEFE, the first byte is between 81-FE, and the last byte is between 40-FE. Excluding the line xx7F, there are a total of 23940 code points, and a total of 21886 Chinese characters and graphic symbols are included; GBK/1 includes supplementary characters other than GB 2312 characters, GBK/2 includes GB2312 characters, GBK/3 includes CJK characters, GBK/4 includes CJK characters and supplementary characters, GBK/5 includes non-Chinese characters, UDC is user-defined Define characters.

The details are as follows:

picture

→ Here you may have two questions, why the tail byte starts from 40 instead of 00; why do you want to exclude the numbers of the two lines FF and xx7F?

GBK's tail byte code is not mandatory to be 1. When the high bit is 0, it conflicts with the ASCII code. Most of the ASCII codes between 00-40 are control characters, so the main purpose of excluding control characters is to prevent loss. bytes lead to systemic serious consequences;

Excluding FF is for compatibility with GB2312, and the bit of GB2312 is not used; while 7F means that the DEL character is to delete a character backwards. If the first byte is lost during transmission, there will be serious consequences, so xx7F needs to be excluded , which is something that all encoding schemes need to pay attention to.

GB18030

With the development of computers, the more than 20,000 characters of GBK are still unbearable, so in 2000, my country formulated a new standard GB18030 to replace the GBK standard. GB18030 is a mandatory standard, and all software sold in mainland China now supports GB18030.

GB18030 is actually aligned with the Unicode standard, which includes all Unicode character sets, and is also an implementation of Unicode (UTF).

Now that we have UTF, why do we need to implement a Unicode implementation?

Mainly UTF-8/UCS-2 are not compatible with GB2312. If they are directly upgraded, they will be completely garbled. Therefore, GB18030 is compatible with GB series and is a superset of GBK and GB2312. When our original GB2312 (GBK) software When considering upgrading to internationalized Unicode, you can directly use GB18030 for upgrading.

Although GB18030 is also an extension of GB2312, its extension method is different from that of GBK. GBK mainly makes full use of some undefined encoding spaces of GB2312, while GB18030 uses byte variable-length encoding, and the single-byte area is compatible with ASCII, The double-byte area is compatible with GBK, and the four-byte area is aligned with all Unicode code points.

The principle of implementation is mainly to use the unused 0x30~0x39 encoding space of the second byte to determine whether it is four bytes.

Bytes coding space code points
single byte 0x00 ~ 0x7F 128
double byte first byte second byte 23940
0x81 ~ 0xFE 0x40 ~ 0x7E (excluding 0x7F)
four bytes first byte second byte third byte fourth byte 1587600
0x81 ~ 0xFE 0x30 ~ 0x39 0x81 ~ 0xFE 0x30 ~ 0x39

* A single byte whose value is from 0 to 0x7F.

* Double byte, the value of the first byte is from 0x81 to 0xFE, and the value of the second byte is from 0x40 to 0xFE (excluding 0x7F).

* Four bytes, the value of the first byte is from 0x81 to 0xFE, the value of the second byte is from 0x30 to 0x39, the value of the third byte is from 0x81 to 0xFE, and the value of the fourth byte is from 0x30 to 0x39.

UNICODE

background introduction

Before Unicode, various countries created a large number of section encoding standards, including single-byte and double-byte (such as GB 2312, Shift JIS, Big5, ISO8859, etc.), each of which is incompatible with each other. In 1987, companies such as Apple, Sun, and Microsoft began to discuss a unified encoding standard covering all characters in the world, and formed the Unicode Alliance. During this period, a lot of research work was done. The core points of the discussion are as follows:

  • How many characters are there in the world and how many bytes are needed to store them?

The working group counted newspapers and other publications around the world at that time, and concluded that two bytes are enough to cover practically meaningful characters all over the world (of course, this only counts currently used characters, not including ancient languages ​​or obsolete languages).

  • Fixed-length encoding or variable-length encoding?

One adopts variable-length encoding, uses one byte for ASCII characters, and uses two bytes for other characters, similar to GBK; the other adopts fixed-length encoding, uses two bytes regardless of whether it is ASCII characters or not.

The choice of the scheme is mainly from the two dimensions of time and space in the computer processing process, that is, the execution efficiency of encoding and decoding and the storage size. The final conclusion is to use double-byte fixed-length encoding, because the space brought by the fixed length becomes larger. In fact, the impact on the overall transmission and storage costs is not great, and the processing efficiency of fixed-length encoding will be significantly higher than that of variable-length encoding, so the early Unicode adopted the form of fixed-length encoding.

  • Are there many similar ideograms in China, Japan, and Korea that can be unified?

Due to the large number of ideographic characters in Chinese characters, if they can be unified, the number of Chinese characters included can be greatly reduced.

Therefore, the initial collection of Chinese characters follows two basic principles: the principle of ideographic recognition and the principle of separation of etymology.

The so-called ideographic recognition principle, that is, the coding of "only correct characters, not correct shapes", combines different glyphs (ie variant characters) of the same character. For example, the first stroke of the character "室" is written differently in China, Japan and South Korea, but it is the same character itself, and only one code is given, and the different writing methods are distinguished by fonts.

The principle of separation of etymology means that different glyphs of the same character are recorded in one etymology, and the two glyphs are coded separately. For example: GBK previously included the three characters "客", "客", and "户", so Unicode also needs to keep three characters. If they are directly merged, it will cause trouble in use.

For example, if the following sentence is not etymologically separated, what will happen?

Original sentence: There are three ways to write a household, which are "户", "房", "戈",

After rewriting: there are three ways to write a household, which are "hu", "hu" and "hu".

Introduction to Unicode

Unicode is called Unicode (also known as Universal Code). It is a set of character encoding system designed according to the modern encoding model, covering abstract character set, numbering, logical encoding, and encoding implementation. Unicode was created to solve the limitations of traditional character encoding schemes. In this language environment, there will be no language encoding conflicts, and any country's language can be displayed on the same screen.

UTF-n encoding (Unicode Transformation Format Unicode character set conversion format, n indicates the number of code elements) is the part of the encoding implementation CES in the Unicode encoding system, like UTF-8, UTF-16, UTF-32 are numbers Convert to the actual binary encoding implementation. In addition to the UTF series, the Unicode encoding implementation also includes UCS-2/4, GB18030, etc. But now many people mistakenly regard Unicode as just a character number, which is actually wrong.

Unicode can accommodate characters and symbols of all countries in the world. Its number range is 0-0x10FFFF, and there are 1,114,112 code points. In order to facilitate management, it is divided into 17 planes. There are 238,605 code points that have been defined, distributed in plane 0, Plane 1, Plane 2, Plane 14, Plane 15, Plane 16. Among them, plane 0 is also called Basic Multilingual Plane (BMP for short), and this plane basically covers the commonly used characters in use in the world today. The characters we usually use are generally located on the BMP plane, and its range has 65,536 code points. Other planes are collectively called supplementary planes. The concept of planes will be introduced in detail in the UTF-16 chapter.

Relationship with UCS

picture

Speaking of Unicode, we have to mention UCS (full name Universal Multiple-Octet Coded Character Set Universal Multiple-Octet Coded Character Set), international standard number ISO/IEC 10646, designed by a working group jointly established by two international standards organizations, ISO and IEC A new unified character set project, the purpose of which is as committed to developing a universal encoding set as the Unicode Consortium.

As early as 1984, the two organizations ISO and IEC established a joint working group to design a new set of Unicode standards, but neither organization knew the existence of the other until the Unicode Alliance released the Unicode draft in 1988 ( The UCS draft was released in 1989), only to find out that everyone is doing the same thing, there is no need to have two sets of standards, so they will consider merging later,

Since UCS was originally designed as a 31-bit encoding space (implemented by UCS-4 encoding), it can accommodate about 2.1 billion characters in 2^31, while Unicode is a 16-bit space (implemented by UTF-16 encoding), so Unicode was initially intended as UCS A proper subset of , that is, every character in Unicode exists in UCS, and both have the same code point, but characters in UCS (numbered over 65,536) do not necessarily exist in Unicode.

However, because the interests of the two parties did not say who dismissed who, in the end the two parties made some compromises to maintain consistency and common development. The encoding ( code point ) of the same character in the two standards must be the same. This is a decision that decides the head. Unicode knows the existence of UCS, and there will be no more Unicode. Of course, the merging work was not accomplished overnight but after several rounds of iterations. ISO/IEC and Unicode released the first version of mutual compatibility in 1993. When the Unicode 2.0 standard was released in 1996, the Unicode character set and UCS character set (ie ISO/IEC IEC 10646-1) is basically consistent. At the same time, Unicode launched the UTF-32 encoding implementation in order to be consistent with the four-byte UCS, and UCS launched the UCS-2 encoding implementation in order to be consistent with the two-byte Unicode.

So now we can think that UCS and Unicode are the same thing . For example, our common java internal operation uses UTF-16 encoding, while the window operating system uses UCS-2, and they are all the same Unicode standard.

→ Why is a 2-byte encoding used here instead of 4 bytes? Leave a suspense first, and I will explain it in detail later

UTF-16 (Java internal encoding)

UTF is the abbreviation of Unicode Transfer Format, which means to convert Unicode into a certain format, so UTF-16 is one of the implementation methods in Unicode encoding, 16 represents the number of bytes , accounting for two bytes (UTF -32 means 4 bytes).

At the beginning of Unicode design, UTF-16, a double-byte fixed-length encoding , is used, and its character number is the corresponding binary number. That is to say, the CCS of the second layer is consistent with the CEF of the third layer. For example, the Unicode code point of the Chinese character "万" is "U+4E07", and its binary sequence is the literal translation "0100 1110 0000 0111". The advantage of this encoding method is that it is efficient and does not need to check the flag bit, but the disadvantage is that it is not compatible with ASCII , ASCII-encoded text will display garbled characters.

However, the Unicode Alliance later found that the 16-bit encoding space was not enough. At the same time, the ISO/IEC organization also felt that the 32-bit encoding space of UCS was too much. In fact, there were no billions of characters at all, and it was a waste of space. The alliance and the ISO/IEC working group reached an agreement: the two use a unified coding space "0000 ~ 10FFFF" (that is, UCS guarantees that no character code points greater than 10FFFF will ever be allocated), and both parties keep synchronization in character encoding, that is, in one standard If characters are added, the other party should also be notified to synchronize.

So Unicode expands the encoding space to 21 bits on the basis of UTF-16, and UCS implements a double-byte UCS-2 encoding.

→ UTF-16 encoding is double-byte, and the upper limit is only more than 6w code points. How can it support 10FFFF (100w+) code points?

The essence is to add a few more bytes to represent more characters, but UTF-16 does not use fixed-length 4 bytes like UCS, but uses a variable-length form, but this is not the same as UTF-8 variable-length method , which is realized by means of surrogate pairs. Most commonly used characters are represented by one code unit (fixed-length 2 bytes), and other extended special characters are represented by two code units (fixed-length 4 bytes).

Surrogate pair

UTF-16, UTF-8, GB series, etc. are all variable-length bytes, but the original design intention is different. Like GBK, it is for compatibility with ASCII, but UTF-16 did not consider ASCII compatibility from the beginning, so its variable length It is a natural growth scheme adopted to save storage space, and grows to 4 bytes when there is not enough space.

Here comes the question, how do I know whether the 4 bytes stored represent one character or two characters? For example, when the program encounters the byte sequence 01001110 00101101 01010110 11111101, should it be judged as one character or two characters?

This requires a leading identification. For example, GB2312 recognizes whether the high bit of the first byte is 1 to determine whether it is a single byte or a double byte, but the high bit 1 of UTF-16 has been used for encoding. Of course, this is not difficult. We, if the first digit is used, then use the combination of the first few digits.

UTF-16 uses a surrogate pair to solve, that is, the high half area code (first two bytes) range D800-DBFF (called proxy code point ), the low half area code (last two bytes) range DC00-DFFF , forming a character represented by four bytes.

picture

The combination of the above leading 6 digits is also particular. The ISO organization requires the numbering range to be 0~10FFFF(), that is to say, 10FFFF characters can be represented by 20 digits. For double code units, each code unit is responsible for 10 digits. The code element is 16 bits, and after the digital bits occupy 10 bits, the remaining 6 bits are used as leading bits.

When UTF-16 is represented by a code unit, the Unicode character number and the code unit sequence are equivalently mapped, but when dual code units are used, the character number and the code unit sequence need to be converted. The following are code units and Unicode The calculation formula between numbered values:

picture

Conversion symbol sequence (CH high half region/CL low half region)

picture

Conversion character number (CH high half area/CL low half area)

plane space

UTF-16 cuts the encoding space 0000 ~ 10FFFF into 17 planes, which are actually divided into 17 blocks, and the number of code points in each plane space is = 65536. The first plane is called the Basic Multilingual Plane ( Basic Multilingual Plane, referred to as BMP), this plane covers the most commonly used characters in the world today, and uses a fixed length of two bytes. Other characters are placed in the supplementary plane , and they all use a fixed length of two code units. 4 bytes, the following is the purpose of each plane

picture

The number of the supplementary plane is represented by 4 bytes of double symbols. After removing the surrogate pair, the effective number of digits is 20, and then the 20-digit number is divided into 16 plane areas, among which the digits in the upper half area Take out 4 bits to represent the plane, and the remaining 16 bits represent the number of characters that each plane can represent, that is, 2 to the 16th power 65536 (two bytes in size)

picture

UTF-16 can be regarded as a superset of UCS-2. Before there is no auxiliary plane, UTF-16 and UCS-2 refer to the same meaning. But when the auxiliary plane characters are introduced, it is called UTF-16.

byte order

As the name implies, byte order refers to the order of bytes. For single-byte encoding, one character corresponds to one byte, so there is no byte order problem; but for UTF-16, a fixed-length multi-byte encoding, it is There is a byte order problem. The byte order is actually related to the operating system and the underlying hardware. Not only the multi-byte encoding such as UTF-16 has byte order, but all multi-byte types of data have byte order problems, such as short, int, and long.

For the convenience of explanation, let us give an example here. For example, storing an integer value "305419896" corresponds to 0x12345678 in hexadecimal. Some people are used to storing in order from left to right, and some people say that the high bit should be placed in the high address and the low bit in The low address should be stored from right to left. So there are the following two access methods.

picture

In fact, there is no better or worse between these two methods, but our cognitive habits are different and the final design is different. It is all the fault of the Arabs. Why must the high digits be on the left? This also caused the famous size side dispute.

Therefore, the byte order has the concept of big endian and little endian, and has formed its own camp. For example, Windows, FreeBSD, and Linux are little endian, and Mac is big endian. In fact, there is no technical difference between big and small endian.

Little-endian (Little-Endian) means that the low-order bytes (that is, the little-endian byte and the tail-end byte) are stored in the low address of the memory , while the high-order bytes (that is, the big-endian byte and the head-end byte) are stored in the memory high address .

Big-endian (Big-Endian) means that the high-order bytes (that is, the big-end byte, the head-end byte) are stored in the low address of the memory, and the low-order bytes (that is, the little-end byte, the tail-end byte) are stored in the memory. high address.

UTF-8

Introduction & Rules

Both Unicode and UCS initially adopted multi-byte fixed-length encoding. Since there are no files and software compatible with the existing ASCII standard, it is difficult to promote the new standard, so UTF-8 compatible with the ASCII version was born.

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, which is the third layer of CEF in the modern character encoding model. It can use one to four bytes to encode all valid code points in the Unicode character set, which is part of the Unicode standard. UTF-8 is designed to solve backward compatibility with ASCII codes. The first 128 characters in Unicode (with ASCII codes One-to-one correspondence), using a single byte of the same binary value as the ASCII code for encoding, which allows the original software that processes ASCII characters to continue to be used without or with a small amount of modification. Therefore, it has gradually become the preferred encoding method for email, web pages and other stored or sent text.

-- Wikipedia

UTF-8 needs to be compatible with ASCII, so it also needs to be controlled by a prefix code . The prefix rules are as follows:

  • If the first byte starts with 0, it is a single-byte encoding (that is, a single single-byte code unit);
  • If the first byte starts with 110, it is a double-byte code (that is, a double-code sequence composed of two single-byte code elements);
  • If the first byte starts with 1110, it is a three-byte code (that is, a three-code sequence composed of three single-byte code elements), and so on.

picture

In theory, UTF-8 variable length can exceed 4 bytes, but the upper limit of the Unicode Alliance specification is 10FFFF, so the design of UTF-8 rules also limits the size.

program algorithm

It is not easy to describe the structure of the algorithm in words, so let's directly appreciate the parsing code written by the originator of UTF-8. This is the encoding and decoding algorithm written by Ken and Rob in one night. The code is very short and concise. For the convenience of reading, I Interpretation added.


typedef
struct
{
    
    
  int   cmask; //前缀码掩码
  int   cval;  //前缀码
  int   shift; //移动位数
  long  lmask; //Unicode值掩码
  long  lval;  //Unicode下限值
} Tab;

static
Tab  tab[] =
{
    
    
  0x80, 0x00, 0*6, 0x7F,       0,         /* 1 byte sequence */
  0xE0, 0xC0, 1*6, 0x7FF,      0x80,      /* 2 byte sequence */
  0xF0, 0xE0, 2*6, 0xFFFF,     0x800,     /* 3 byte sequence */
  0xF8, 0xF0, 3*6, 0x1FFFFF,   0x10000,   /* 4 byte sequence */
  0xFC, 0xF8, 4*6, 0x3FFFFFF,  0x200000,  /* 5 byte sequence */
  0xFE, 0xFC, 5*6, 0x7FFFFFFF, 0x4000000, /* 6 byte sequence */
  0, /* end of table */
};

/**
* 把一个多字节序列转换为一个宽字符
* 
* @param p 存放计算后的unicode值
* @param s 需要解析的UTF-8字节序列
* @param n 字节长度
* @return 解析的字节长度
*/
int mbtowc(wchar_t *p, char *s, size_t n)
{
    
    
  long l;  int c0, c, nc;  Tab *t;
  if(s == 0) return 0;
  nc = 0;
  //异常校验(可不用关注)
  if(n <= nc) return -1;
  //c0 此处备份一下首字节,后续需要用到前缀码
  c0 = *s & 0xff;
  //l 保存 Unicode 结果
  l = c0;
  /* 遍历tab,从单字节结构->2字节结构->..依次检查找到对应tab */
  for(t=tab; t->cmask; t++) {
    
    
    //字节数+1,字节数和tab结构是对应的,也就是当nc=1时 tab结构是单字节,nc=2是tab是两字节
    nc++;
    /* 判断前缀码跟当前的tab是否一致, 如果一致计算最终unicode值并返回*/
    if((c0 & t->cmask) == t->cval) {
    
    
      //通过 & Unicode有效值掩码,移除高位前缀码,得到最终unicode值
      l &= t->lmask;
      //异常校验
      if(l < t->lval) return -1;
      //保存结果并反回
      *p = l;
      return nc;
    }
    //异常校验
    if(n <= nc) return -1;
    //读取下个字节;如果上面判断前缀码不一致,说明需要再读取下个字节
    s++;
    //计算有效位的值,目的是去除UTF-8 编码从第二个字节开始的高两位10
    // 例如 s=10101111、0x80=10000000 计算结果是00101111,这样就去除了高位前缀10
    c = (*s ^ 0x80) & 0xFF;
    //异常校验
    if(c & 0xC0) return -1;
    //重新计算unicode值,根据UTF-8规则c只有低 6 位有效,所以通过移位把c填入到l的低6位
    l = (l<<6) | c;
  }
  //返回异常
  return -1;
}

fault tolerance

Through the above program, we know that the parsing process is processed byte by byte. If there is a partial byte error or loss during the transmission process, or if there is a byte in the middle that does not match the rules, will it affect it? Parsing of the entire text?

Let's first look at the fault tolerance of other encodings. For single-byte ASCII codes, if one byte is lost, one character will be lost, which will not affect the content of the subsequent text. For example, Hello world, after the b2 byte is lost, the content is Hllo world is missing an e

Let's look at the multi-byte encoding of GB2312 again. If the b2 byte is lost, the entire text will be messed up. This is the worst. Most multi-byte encodings have similar problems. Once an error occurs, the entire file may be required. Retransmission.

picture

Next, let's see how UTF-8 avoids the situation of "one mouse droppings ruined a pot of porridge". The first byte of UTF-8's code element sequence indicates the number of bytes that follow, such as If the high bit of the first byte is 0, it means a single byte, 110 means a total of two bytes, 1110 means three bytes, and so on. Except for the first byte, the subsequent bytes start with 10. Therefore, the prefix code of UTF-8 is very robust . Even if individual bytes are lost, added, or changed, it will not cause transitive and chained errors such as all subsequent characters being confused.

picture

Summarize

Just a single character encoding, after in-depth understanding, I found that there is such a strong development process. Just imagine, if the computer is still the same as the mainframe before, and the personal computer does not develop in a blowout, there will be no such character encoding. If ASCII was originally designed as Multi-byte encoding, there is no such thing as UNICODE behind.

This is a very typical architecture design problem. Is a good architecture designed or evolved?

  • Some people say that a product architecture without design has no soul, and it dies quickly on the way of development.
  • Some people say that it was created by design. This is a kind of perfectionist. If you design it 50 or 100 years ahead of time, the company may have closed down, and there are many products and architectures that are popular or not. yes.

A good architecture depends on both design and evolution. **As the old saying goes, three points of goodness depend on design and seven points depend on evolution. We must learn to be both pragmatic and forward-looking. **At least we need to survive first.

Guess you like

Origin blog.csdn.net/qq_43842093/article/details/131605971