[MySQL Study Notes (3)] Character Set and Comparison Rules in MySQL

This article is published by the official account [Developing Pigeon]! Welcome to follow! ! !


Old Rules-Sister Town House:

One. Character set and comparison rules

(I. Overview

       Binary data is actually stored in the computer. For string data, it is stored by establishing a mapping relationship between string and binary data. First, the character range must be defined, and then the process of mapping characters to binary data is called encoding. The binary data is mapped The process for characters is decoding. The character set is used to represent the encoding rules of a certain character range, that is, the character set defines the rules of which characters are mapped to which binary data.

       After setting the character set, how to compare the size of two characters? The easiest thing to think of is to directly compare the size of the binary data corresponding to two characters. This is called a binary comparison rule. But in reality, many situations are more complicated. For example, when English letters are not case-sensitive, you need to convert the characters with different uppercase or lowercase to uppercase or lowercase, and then compare the binary data.

(2) Commonly used character sets

1. ASCII character set

       A total of 128 characters, including spaces, punctuation marks, numbers, uppercase and lowercase letters, etc., can be encoded using one byte, that is, 8 bits.

2. ISO 8859-1 character set

       There are a total of 256 characters, and 128 characters commonly used in Western Europe are expanded on the basis of ASCII, and one byte encoding can also be used.

3. GB2312 character set

       Contains Chinese characters, etc., compatible with the ASCII character set. If the character is in the ASCII character set, use one-byte encoding, otherwise use two-byte encoding. This encoding method that uses different bytes to represent a character is called a variable-length encoding method.

4. GBK character set

       Expanded GB2312, compatible with GB2312.

5. UTF-8 character set

       Almost all characters in all regions are included, compatible with ASCII, using variable-length encoding, and encoding one character uses 1-4 bytes.

       When reading a byte, how to distinguish whether the byte represents a single character or a part of a character? Since the byte is in the range of 0-127, that is, the highest bit of the byte is 0, the byte represents a single character, otherwise the highest bit of the byte is 1, which exceeds the range of 127, indicating that it is two words The stanza represents a character.


two. Character set and comparison rules in MySQL

(A) utf8

       In MySQL, the character set represents the maximum byte length used by a character, which will affect the storage and performance of the system. utf8mb3 only uses 1-3 bytes to represent characters, and the alias is utf8; utf8mb4 uses 1-4 bytes to represent characters, which is the authentic UTF-8 character set.

(2) View the character set

SHOW CHARSET LIKE 匹配模式;

       CHARSET and CHARACTER SET are synonymous. The ASCII character set occupies one byte, latin1 occupies one byte, GB2312 occupies two bytes, and GBK occupies two bytes.

(3) View the comparison rules

SHOW COLLATION LIKE 匹配模式;

       A character set may correspond to several comparison rules. The name of the comparison rule starts with the name of its associated character set. For example, the comparison rules of utf8 all start with utf8, followed by the language of the comparison rule, such as utf8_polish_ci. The marking rules of Polish, the special one is utf8_general_ci which means general comparison rules.

       The last suffix means accent, uppercase and lowercase, and binary. For example, _ai (accent insensitive) means accent insensitive, _as (accent sensitive) means accent sensitive, _ci (case insensitive) means case insensitive, _cs(case sensitive) means case sensitive, _bin(binary) means binary Compare.

three. Character set and application of comparison rules

       The character set and the comparison rule correspond to each other, no matter which one is modified, the corresponding one will be modified automatically.

(1) Four levels of character set comparison rules

1. Server level

       Two system variables represent server-level character sets and comparison rules:

character_set_server : 服务器级别的字符集;
collation_server : 服务器级别的比较规则;

       When starting the server program, you can modify the values ​​of these two variables through startup options or configuration files.

2. Database level

       When creating and modifying a database, you can specify the character set and comparison rules of the database. Similarly, there are two system variables to represent the character set and comparison rules at the database level:

character_set_database;
collation_database;

       Set the character set and comparison rules of the database when creating and modifying the database:

CREATE DATABASE test CHARACTER SET utf8 COLLATE utf8_polish_ci;
ALTER DATABASE test CHARACTER SET utf8 COLLATE utf8_polish_ci;

       If the character set and comparison rules are not specified when creating the database, the server-level character set and comparison rules will be used.

3. Table level

       Similarly, when creating and modifying a table, specify the character set and comparison rules of the table, if not specified, it will inherit the database.

4. Column level

       For columns storing strings, different columns in the same table can also have different character sets and comparison rules, which we can specify when creating or modifying columns.

CREATE TABLE 表名 (
	列名 字符串类型 CHARACTER SET 字符集名称 COLLATE 比较规则名称
);
ALTER TABLE 表名 MODIFY 列名 字符串类型 CHARACTER SET 字符集名称 COLLATE 比较规则名称;

       Similarly, if you do not specify the character set and comparison rules of the column, the table will be inherited. According to the data to be stored, select the character set and comparison rule of the set column.

(2) The character set in the communication between the client and the server

       From the perspective of the machine, the request sent by the client and the response returned by the server are essentially a sequence of bytes. During the request-response process, it has undergone many character set conversions.

1. The client sends a request

       In general, the character set used by the client when encoding the request string is consistent with the character set currently used by the operating system. This is a rule in Unix-like operating systems, but in Windows systems, if you specify when you start the MySQL client program If the default-character-set startup option is used, the requested character string will be encoded in this character set.

       In Unix-like systems, use the following command to view the character set of the current system:

echo $LC_ALL

2. The server receives the request

       The server receives a byte sequence, and the server regards this byte sequence as a byte sequence encoded with the character set represented by the system variable character_set_client, which is at the SESSION level. It can be seen that the character set used by the client when encoding the request string and the encoding character set used by the byte sequence that the server considers when receiving the byte sequence are two independent character sets.


3. The server processes the request

       When the real processor requests, it will convert the byte sequence encoded by the character set corresponding to character_set_client to the byte sequence encoded by the SESSION level variable character_set_connection, and collation_connection represents the corresponding comparison rule.

4. The server generates a response

       After the server has processed the request, it will not directly send the result to the client, but will use the character set corresponding to the SESSION-level system variable character_set_results for encoding, and then send it to the client.


5. The client accepts the response

       Unix-like uses the character set of the operating system to interpret the byte sequence of the response, and Windows uses the default character set of the client to interpret.


6. Client default character set

       Each MySQL client maintains a client default character set. The client automatically detects the character set currently used by the system when it is started, and maps it to the character set supported by MySQL according to certain rules. If MySQL does not support the system's character set, set the default character set of the client to the default character set of MySQL. If the default-character-set startup option is specified, this option is used as the default option of the client. The default character set of MySQL is latin1 before 5.7 and utfmb4 after 8.0.

       When connecting to the server, the server initializes character_set_client, character_set_connection, and character_set_results to the default character set of the client.

Guess you like

Origin blog.csdn.net/Mrwxxxx/article/details/113795388