Character set & character encoding


This article on In the computer field A brief introduction to the characters, character sets and character encodings.

1. Characters

A character is a unit of information, including words, numbers, symbols (including punctuation marks, graphic symbols, control characters), etc. Yes 用来给人显示(control characters can also be regarded as control display formats), in the data structure 最小的数据存取单位.

Second, the font library

The collection of all displayable characters.

Three, character set

1. Generate background

The computer needs to convert the characters into binary before it can process the characters.

2. Definition

Character set, the relationship between 字符the set and the computer 二进制sequence 映射.

3. Examples

1) ASCII

ASCII (American Standard Code for Information Interchange), the earliest character set.

Use 8 bits to store, the first bit is defined as 0, and the following 7 bits are used to represent 128 characters, including common English characters and some control symbols.

2) ANSI

In order to be able to express languages ​​other than English, channel symbols are programmed with the highest bit of idle, which is an extension of ASCII.

Stored in 8 bits, up to 256 symbols, of which 0~127 symbols are the same as ASCII, 128~255 is an extension of relative ASCII, which is called "extended character set". The ISO organization has developed a series of character codes: ISO-8859-1~ISO-8859-15. Among them, ISO-8859-1 (also known as Latin-1) covers most Western European language characters and is the most widely used of all.

3) Unicode

It was born in order to integrate all the languages ​​of the world, the full name is Universal Multiple-Octet Coded Character Set. That is UCS (Universal Character Set)(there are other character sets for the purpose of integrating all, and then everyone reached a consensus to use UCS as the Unicode character set).

  • UCS-2

    Use 2个字节encoding.

  • UCS-4

    Use 4个字节coding, the highest bit is 0.

    UCS-4 divides the highest bit into 2 7 = 128 Groups, and each Group is divided into 2 8 = 256 Planes according to the second byte, and each Plane is divided into 256 Rows according to the third byte, each Row According to the 4th byte, it is divided into 256 Cells.

Insert picture description here

Among them, the code of Plane 0 of Group 0 (the upper two bytes are 0) is called Basic Multilingual Plane, that is BMP.

Remove the two zero bytes in front of BMP to get UCS-2. However, none of the characters in the current UCS-4 specification have been allocated outside the BMP.

4) Chinese character set

GB, the abbreviation of GuoBiao (National Standard), is the character set corresponding to Chinese characters.
The location code in the GB series of codes can be considered as a character set.

Four, character encoding

1. Cause

In order to be more suitable for computer storage and network transmission, it is not appropriate to store directly in accordance with the character set 字符集. The serial number (binary sequence) defined in the middle is required 再次转换, which results in the character encoding.

2. Definition

Character encoding specifies how to encode and store the binary sequence corresponding to these characters.

3. Examples

1) ASCII

Directly store and transmit characters in the corresponding manner between the characters and sequences specified in the ASCII character set.

2) ANSI

Directly store and transmit characters in the corresponding manner between characters and sequences specified in the ANSI character set.

3) UTF

UTF (Unicode/UCS, Transformation Format), the character encoding corresponding to the Unicode character set.

Unicode uses two or four bytes to represent a character, which makes the code in front of many English letters all 0, which wastes system resources. Therefore, UTF encoding was produced.

  • UTF-8

    Use 变长(1~4 bytes) encoding method, that is, change the byte length according to different symbols, so that the longest character encoding is as short as possible. The coding rules are:

    • For a single-byte symbol, the first bit of the byte is set to 0, and the following 7 bits are the Unicode code of this symbol. Therefore 英语字母, the UTF-8 encoding 和ASCIIcode is yes 相同.
    • For an n-byte symbol (n>1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are all set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol. The first byte of single-byte encoding is 00-7F, the first byte of double-byte encoding is C2-DF, and the first byte of three-byte encoding is E0-EF.
      Insert picture description here

    Therefore, UTF-8 has the followingFeatures

    • As long as you see 第一个字节的范围it, you can know 编码的字节数, greatly simplifying the encoding and decoding algorithm;
    • The BOM byte is no longer needed to indicate the byte order, but the BOM can be used to indicate the encoding method. The UTF-8 encoding of the character "Zero Width No-Break Space" is EF BB BF, so if the receiver receives a byte stream beginning with EF BB BF, it will know that it is UTF-8 encoding. Windows is to use the encoding BOM to mark the text file, but in fact, due to some software, language and subject to restrictions COOKIE out mechanism does not recognize or use the BOM head, so as to avoid mistakes 不建议UTF-8格式的文件使用BOM. If you have to use UTF-8, you can save the file as ASCII code when it only contains English characters (or characters in ASCII encoding); when it contains Chinese characters, you can save the file as "UTF-8 without BOM"
  • UTF-16

    It is regulated by the RFC2781 protocol. Use 两个字节or 四个字节represent a character. When two bytes are used, UCS-2the relationship between the characters and sequences specified in is basically the same; when four bytes are used, UTF-16 can represent a part of UCS-4 characters (\u10000~ \u10FFFF). UTF-16 can be divided into three types:

    • UTF-16

      BOM (Byte Order Mark)The characters at the beginning of the file are required to indicate whether the file is Big Endian or Little Endian (see byte order (big endian & little endian) for details).

    • UTF-16BE(Big Endian)

    • UTF-16LE(Little Endian)

    Example: The three characters "ABC" are encoded in various ways and the results are as follows:
    Insert picture description here

  • UTF-32

    The use of 四个字节representation characters can completely represent all the characters of UCS-4, and there is no need to use complex algorithms like UTF-16 to represent some UCS-4 characters. UTF-16 can be divided into three types:

    • UTF-32

      BOM (Byte Order Mark,字节顺序的标识)The characters at the beginning of the file are required to indicate whether the file is Big Endian or Little Endian (see byte order (big endian & little endian) for details).

    • UTF-32BE(Big Endian)

    • UTF-32LE(Little Endian)

    Example: The three characters "ABC" are encoded in various ways and the results are as follows:Insert picture description here

4) GB series

GB, the abbreviation of GuoBiao (National Standard), is the character set corresponding to Chinese characters.

  • GB2312

    Use 16 bits (2 bits) to represent common Chinese characters and some symbols.

    At the same time, 兼容single-byte ASCIIencoding can be understood as a variable-length encoding mixed with single-byte and double-byte.

  • GBK1.0

    Compatible with GB2312, containing more text and symbols. Currently 使用最广泛.

    While the 兼容single-byte ASCIIcode that can be understood 单字节and 双字节mixed variable length coding.

  • GB18030

    Compatible with GB2312 and GBK, more characters and symbols are incorporated, yes 国家正式标准.

    Variable length coding, adopting 单字节, 双字节and 四字节scheme. Among them, single-byte and double-byte sums GBKare complete 兼容, and four-bytes are extended.

4. Use

  • operating system

    UTF-8: Most Linux systems, Mac OS default encoding

    GBK: Chinese version of Windows system default encoding

  • shell

    For a stand-alone system, the terminal code 与操作系统is generally yes 一致, but you may encounter some problems when logging in remotely.

  • Text file

    Mostly optional

  • program

    Related to the specific programming language, it involves the state of variables in the memory when the program is running.

    For example, in Java and Python3, characters are encoded in Unicode (Java.lang.String uses UTF-16 encoding to store all characters), so Chinese can be well supported. However, Unicode in Python2 is not the default character encoding format (Python has only supported Unicode since 2.2), so encoding conversion is required. The function decode( char_set) can realize the conversion from other codes to Unicode, and the function encode( char_set) can realize the conversion from Unicode to other codes. The Unicode String mentioned here refers to Code Points encoded by UCS-2 or UCS-4. Note that only character-to-byte or byte-to-character conversion has the concept of encoding and transcoding .

common problem

Character set VS character encoding

Character encoding can be seen as an implementation of 二次编码a character set , or , therefore, a character set can correspond to multiple character encodings.

If for a certain character set, if its storage and encoding are the requirements between the characters defined in the character set and the binary sequence, then the character set and character encoding are the same. Such as: ASCII and ANSI.

Utf8 and utf8mb4 in MySQL

  • MySQL "utf8mb4" is the real "UTF-8".
  • MySQL's "utf8" is a kind of "exclusive encoding", it can encode not many Unicode characters.

Why use Unicode storage and UTF-8 transmission?

  • Why use Unicode storage?
    First of all, the statement "use Unicode storage" is not entirely correct, it should be said exactly “使用定长UTF-16存储”. The reason is as follows.
    At first, it Unicodewas Windows 定长16比特LEthat named the encoding it used , and the computer did use Unicode (fixed-length 16-bit LE) to store it. But then the introduction of Chinese characters that Unicodehave 升级成32位(not the original 16-bit encoding format) in order to include all the characters. And defined UTF-16 to represent 16-bit fixed-length encoding, and added part of the 32-bit encoding in Unicode to UTF-8, which became an indefinite-length encoding. However, computer storage generally still uses the previous fixed-length UTF-16, but the use of Unicode encoding has been used for storage. The fixed-length UTF-16 is convenient for the computer to read the data quickly, and it covers most of the commonly used characters.
  • Why use UTF-8 transmission?
    UTF-8 saves space when transmitting ANSI characters, so it is often used in data transmission. However, when storing in a computer, the variable-length encoding rule requires the computer to scan it from the beginning to know the position of each character, which is a waste of time.

The difference between base64 and UTF?

The difference between base64 and UTF?

references

https://zh.wikipedia.org/zh-cn/%E5%AD%97%E7%AC%A6_(%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7% 91%E5%AD%A6) Characters-Wikipedia

https://baike.baidu.com/item/%E5%AD%97%E7%AC%A6 characters-Baidu Encyclopedia

https://dailc.github.io/2017/05/03/char_charset_charEncoding.html character, character set, character encoding

http://cenalulu.github.io/linux/character-encoding/ Find out the character set and character encoding in ten minutes

https://www.zhihu.com/question/20152853/answer/95576659 For character encoding, what aspects should programmers know about it? - Know almost

https://www.cnblogs.com/jy107600/p/7208455.html About UTF8 file with BOM header may cause error parsing-Blog
https://zhuanlan.zhihu.com/p/73971487 Remember, never in Use "utf8" in MySQL-Know
the difference between UTF-8 format encoding and UTF-8 without BOM format encoding (including java files)-CSDN Blog
https://blog.csdn.net/fhzaitian/article/details/51482556 ://www.zhihu.com/question/52346583/answer/130139771 Why not directly use UTF-8 encoding for storage in the computer but use Unicode and then convert to UTF-8?

Guess you like

Origin blog.csdn.net/u013617791/article/details/103780291
Recommended