Character set and encoding

Character set : a collection of symbols and control characters and their corresponding relationship with binary

Encoding : The rules for computers to read and store characters in a character set

For example, the ASCii character set has 128 characters, of which 'A' corresponds to decimal 65, and the ASCii encoding method uses a byte with the first 0 to represent each character. Compared with the character set, you can read the meaning of the byte

ASCii: contains 128 characters

iso-8859-1: contains 256 characters, backward compatible with ASCii, that is, its 1~128 are the same as ASCii

ANSI: An extended ASCII code, which is different for each country. Simplified Chinese is ASCII+GBK, traditional Chinese is Big5, and Japanese is Shift_JIS

Unicode

Now the most important is the Unicode character set (compatible with ASCii), which contains almost all the characters in the world. For example, the Unicode code corresponding to the Chinese character 'yan' is hexadecimal 4E25. What is missing now is an encoding method to implement this character set.

There is a problem when designing the encoding implementation: as a large-scale character set, it is likely that some characters can be represented by one byte or even a few bits, and some require several bytes to be represented. Assuming that the last character of the character set must be represented by a length of 4 bytes, do we stipulate that each character is represented by 4 bytes, and the empty space in front is set to 0;

Of course, this is very inappropriate (it will also be introduced later), which will cause a great waste of storage space, and the characters we commonly use are placed in the front, that is, they can be represented with fewer bytes, then the storage Space will inevitably save a lot of waste data 0

utf-8

Now the most commonly used unicode encoding method is utf-8, the biggest feature is that it is a variable-length encoding method. It uses 1~4 bytes to represent a symbol.

There are two encoding rules for utf-8:

1. A single-byte character, the first bit is 0, and the last 7 bits are the unicode code of this symbol

2. For an n-byte symbol (n>1), the first n bits of the first byte are 1, the n+1 bits are 0, other bytes all start with 10, and all the remaining bytes represent the unicode code of this symbol

Unicode Symbol Range | UTF-8 Encoding
(Hexadecimal) | (Binary)
--------------------+---------- -----------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, the Unicode code of 'strict' is 4E25 (100111000100101) in the range of 0800~FFFF, so it needs three bytes to represent, so the utf-8 encoding of 'strict' is "11100100 10111000 10100101", expressed in hexadecimal That's E4B8A5

USC-2

Another encoding method of unicode is USC-2, that is, all characters are expressed in two bytes, which is the so-called unicode in notepad. You can see that there is a unicode big endian behind. In fact, this is the same encoding method. different storage methods

little endian and big endian

These two names come from Gulliver's Travels. In the book, a civil war broke out in the Lilliputian country. The reason for the war was that the people were against each other, and there was a disagreement between eating eggs from the big end or from the small end.

The memory in the computer is numbered, and reading a continuous memory starts from the smallest number.

For example: the two memories numbered 10 and 11 in the memory should store the Unicode code 'strict' (4E25) encoded by ucs-2,

Store high-order 4E in memory 10 and store 25 in memory 11. This high-order storage method is big-endian

Otherwise, it is called little-endian

There is only one storage method for utf-8, so the computer naturally knows how to read the storage, but there are two storage methods for usc-2, and the computer has no way to start. Therefore, a flag is required, the first two bytes of the big-endien file are FE FF, and the littile-endian is FF FE

Character set and encoding

Guess you like