First acquaintance with programming | Computer knowledge that programmers must know, character encoding!

1. ASCII code

We know that all information inside the computer is ultimately a binary value. Each binary bit (bit) has two states, 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte.

In other words, a byte can be used to represent a total of 256 different states, and each state corresponds to a symbol, that is, 256 symbols, ranging from 00000000 to 11111111.

In the 1960s, the United States formulated a set of character codes to uniformly regulate the relationship between English characters and binary digits. This is called the ASCII code and is still in use today.

The ASCII code specifies a total of 128 characters. For example, the space SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001).

These 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly defined as 0.

Two, non-ASCII encoding

English encoding with 128 symbols is enough, but for other languages, 128 symbols are not enough. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by ASCII code.

As a result, some European countries decided to use the highest bit of the unused byte to program a new symbol. For example, the code of é in French is 130 (binary 10000010).

In this way, the coding system used in these European countries can represent up to 256 symbols.

However, a new problem has arisen here. Different countries have different letters, so even if they all use 256 symbol encoding methods, they represent different letters.

For example, 130 represents é in the French encoding, but it represents the letter Gimel (?) in the Hebrew encoding, and it represents another symbol in the Russian encoding.

But anyway, in all these encoding methods, the symbols represented by 0-127 are the same, and the only difference is the segment of 128-255.

As for the scripts of Asian countries, more symbols are used, with as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough. You must use multiple bytes to represent one symbol.

For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so in theory, it can represent up to 256 x 256 = 65536 symbols.

The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB type has nothing to do with the Unicode and UTF-8 below.

3. Unicode

The same binary number can be interpreted as different symbols.

Therefore, if you want to open a text file, you must know its encoding method, otherwise garbled characters will appear if you decode it with the wrong encoding method. Why do emails often appear garbled?

It is because the coding method used by the sender and the recipient is different.

It is conceivable that if there is a code, all the symbols in the world are included. Each symbol is given a unique code, then the garbled problem will disappear.

This is Unicode, as its name implies, this is an encoding of all symbols.

Unicode is of course a large collection, and the current scale can hold more than 1 million symbols. The encoding of each symbol is different.

For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character strict. For the specific symbol correspondence table, you can query unicode.org or the special Chinese character correspondence table.

Four, UTF-8

The popularity of the Internet strongly requires a unified coding method. UTF-8 is the most widely used implementation of Unicode on the Internet.

Other implementation methods include UTF-16 (characters are represented by two bytes or four bytes) and UTF-32 (characters are represented by four bytes), but they are basically not used on the Internet.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

The UTF-8 encoding rules are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the following 7 bits are the Unicode code of this symbol. Therefore, for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n> 1), the first n bits of the first byte are all set to 1, the n + 1 bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol.

5. Little endian and Big endian

Taking the Chinese character Yan as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25.

When storing, 4E is in the front and 25 is in the back, which is the Big endian mode; 25 is in the front and 4E is in the back, which is the Little endian mode.

The first byte is the "big endian" (Big endian), and the second byte is the "little endian" (Little endian).

So naturally, there will be a question: how does the computer know which way to encode a certain file?

The Unicode specification defines that a character representing the encoding sequence is added to the front of each file. The name of this character is called "zero width no-break space" (zero width no-break space), which is represented by FEFF. This is exactly two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that the file adopts big-end mode; if the first two bytes are FF FE, it means that the file adopts small-end mode.


In addition, if you want to better improve your programming ability, learn C language and C++ programming! Overtaking in a curve, one step faster! I may be able to help you here~

UP has uploaded some video tutorials on learning C/C++ programming on the homepage. Those who are interested or are learning must go and take a look! It will be helpful to you~

Sharing (source code, actual project video, project notes, basic introductory tutorial)

Welcome partners who change careers and learn programming, use more information to learn and grow faster than thinking about it yourself!

Programming learning:

Programming learning:

Guess you like

Origin blog.csdn.net/weixin_45713725/article/details/115264130