Concept analysis: character set, character encoding, byte order, ASCII, GBK, Unicode, UTF-8, ANSI

Concept analysis (1): character set, character encoding, byte order

Character set character set: As the name implies, it refers to the character set, which represents the range of characters that can be covered and represented in many characters such as simplified Chinese, traditional Chinese, English, Arabic, Korean, Japanese, etc. It is worth noting that, Although a character set usually has a default character encoding method, strictly speaking, a character set only represents a set of characters, and does not represent the encoding method and storage byte order of characters;

Character encoding encoding refers to the encoding rules of characters, including whether the encoding length is single-byte, double-byte or four-byte, the corresponding relationship between encoding and characters in the character set, etc.;

Byte order BYTE-ORDER: big-endian, little-endian, refers to the storage order of multi-byte data, if the character encoding adopts a multi-byte scheme, it naturally involves the storage order of encoded data in memory, endian refers to is the tail byte of multi-byte data, that is, the low-order byte, big/little refers to the storage address, so big-endian means that the low-order byte of multi-byte data is stored in the large address (high-order address), little-endian refers to the low-order address storing the low-order byte;

DEC (Digital Equipment Corporation, now part of Compaq) and Intel's machines (X86 platforms) generally use little endian.

IBM, Motorola (Power PC), Sun machines generally use big endian. Of course, this is not the case in all cases. Some CPUs can work in both little endian and big endian, such as ARM, Alpha, Motorola's PowerPC. Refer to the processor manual for details. Whether this type of CPU is big endian or little endian should be related to the specific settings. (For example, the Power PC supports little-endian byte order, but is big-endian in its default configuration).

Generally speaking, most users' operating systems (such as windows, FreeBsd, Linux) are Little Endian. A few, such as MAC OS, are Big Endian.

So, Little Endian or Big Endian has something to do with the operating system and chip type. On Linux systems, you can find the string BYTE_ORDER (or _BYTE_ORDER, __BYTE_ORDER) in /usr/include/ (including subdirectories) to determine its value. BYTE_ORDER is called byte order in Chinese. This value is generally found in endian.h or machine/endian.h file, and sometimes in feature.h, which may be different for different operating systems. The data storage order in the program written in C/C++ language is related to the CPU where the compilation platform is located, while the program written in J***A only uses the big endian method to store data. Coincidentally, all network protocols also use the big endian method to transmit data. So sometimes we also call the big endian method network byte order. When two hosts with different byte orders communicate, they must be converted from byte order to network byte order before transmitting data. The following four macros are provided in ANSI C to convert endianness.

Therefore, the character set specifies the range of characters included, the character encoding specifies how the characters are encoded and represented, and the byte order specifies how the character encoding is stored and transmitted.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324578902&siteId=291194637