Detailed explanation of MySQL character set and collation rules and MySQL view character set and collation rules

Detailed explanation of MySQL character set and collation rules

Before explaining character sets and proofreading rules, let's take a brief look at characters, character sets and character encodings.

Character (Character) is the collective name for letters, numbers, and symbols in computers. A character can be a Chinese character, an English letter, an Arabic numeral, a punctuation mark, etc.

Computers store data in binary form. The numbers, English, punctuation marks, Chinese characters and other characters we usually see on the display are the result of binary number conversion.

Character set (Character set) defines the correspondence between characters and binary, and assigns unique numbers to characters. Common character sets include ASCII, GBK, IOS-8859-1, etc.

Character encoding (Character encoding) can also be called character set code, which specifies how to store character numbers in the computer.

大部分字符集都只对应一种字符编码,例如:ASCII、IOS-8859-1、GB2312、GBK,都是既表示了字符集又表示了对应的字符编码。所以一般情况下,可以将两者视为同义词。Unicode 字符集除外,Unicode 有三种编码方案,即 UTF-8、UTF-16 和 UTF-32。最为常用的是 UTF-8 编码。

 Collation rules can also be called collation rules, which refer to comparison rules between characters in the same character set. There is a one-to-many relationship between character sets and collation rules, and each character set has a default collation rule. Character sets and collation rules complement each other and are interdependent.

Simply put, character sets are used to define how MySQL stores strings, and collation rules are used to define how MySQL compares strings.

Friends who want to know about ASCII, GB2312, GBK, and Unicode character sets can click on the following link to read and learn:

Some databases do not clearly distinguish between character sets and collation rules. For example, when creating a database in SQL Server, selecting a character set is equivalent to selecting a character set and collation rules.

In MySQL, the character set and collation rules are separated, and the character set and collation rules must be set. Under normal circumstances, there is no special requirement, just set one of them. When only the character set is set, MySQL will set the collation rules to the corresponding default collation rules in the character set.

Dachang senior database engineer mysql database practical training icon-default.png?t=N7T8https://edu.csdn.net/course/detail/39021

You canSHOW VARIABLES LIKE 'character%';check the character set currently used by MySQL through the command. The command and running results are as follows:

mysql> SHOW VARIABLES LIKE 'character%';
+--------------------------+---------------------------------------------------------+
| Variable_name            | Value                                                   |
+--------------------------+---------------------------------------------------------+
| character_set_client     | gbk                                                     |
| character_set_connection | gbk                                                     |
| character_set_database   | latin1                                                  |
| character_set_filesystem | binary                                                  |
| character_set_results    | gbk                                                     |
| character_set_server     | latin1                                                  |
| character_set_system     | utf8                                                    |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.7\share\charsets\ |
+--------------------------+---------------------------------------------------------+
8 rows in set, 1 warning (0.01 sec)

 The above running results are explained in the following table:

name illustrate
character_set_client Character set used by the MySQL client
character_set_connection Character set used when connecting to the database
character_set_database Create the character set used by the database
character_set_filesystem The character set used by the MySQL server file system. The default value is binary without any conversion.
character_set_results The character set used by the database when returning data to the client
character_set_server The character set used by the MySQL server is recommended to be managed by the system itself and not to be defined manually.
character_set_system The character set used by the database system. The default value is utf8 and does not need to be set.
character_sets_dir Character set installation directory

When garbled characters occur, you do not need to care about the three system variables character_set_filesystem, character_set_system and character_sets_dir, as they will not affect the garbled characters.

SHOW VARIABLES LIKE 'collation\_%';You can view the current collation rules used by MySQL through commands. The commands and running results are as follows:

mysql> SHOW VARIABLES LIKE 'collation\_%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | gbk_chinese_ci    |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set, 1 warning (0.01 sec)

 The above operation results are explained as follows:

  • collation_connection: Collation rules used when connecting to the database
  • collation_database: Collation rules used when creating the database
  • collation_server: Collation rules used by MySQL server


The proofreading rules and commands are as follows:

  • Begins with the character set name corresponding to the collation rule
  • Center the country name (or center it with general)
  • Ending with ci, cs or bin, ci means case-insensitive, cs means case-sensitive, and bin means comparison by binary coded value.

MySQL character set conversion process

The conversion process of character sets in MySQL is as follows:

1) When executing MySQL commands or sql statements in the command prompt window (cmd command line), these commands or statements are converted from the "command prompt window character set" to the ones defined by "character_set_client" character set.

2) After successfully connecting to the MySQL server using the command prompt window, a "data communication link" is established. MySQL commands or sql statements are transmitted to the MySQL server along the "data link", and the character set defined by character_set_client is converted into character_set_connection Defined character set.

3) After the MySQL service instance receives the MySQL command or sql statement in the data communication link, it converts the MySQL command or sql statement from the character set defined by character_set_connection to the character set defined by character_set_server.

4) If the MySQL command or sql statement operates on a certain database, convert the MySQL command or sql statement from the character set defined by character_set_server to the character set defined by character_set_database.

5) After the MySQL command or sql statement is executed, the execution result is set to the character set defined by character_set_results.

6) The execution results are returned along the original path of the open data communication link, and the execution results are converted from the character set defined by character_set_results to the character set defined by character_set_client, and finally converted into the command prompt window character set and displayed in the command prompt window. .

 

Mysql database basic skills full practice icon-default.png?t=N7T8https://edu.csdn.net/course/detail/36210

MySQL view character set and collation rules

In the " MySQL Character Set and Collation Rules Detailed Explanation " section, we learned about the MySQL character set and collation rules. This section mainly introduces several methods to view the character set and collation rules.

In MySQL, the command and execution process to view available character sets are as follows: 

mysql> SHOW CHARACTER set;
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| dec8     | DEC West European               | dec8_swedish_ci     |      1 |
| cp850    | DOS West European               | cp850_general_ci    |      1 |
| hp8      | HP West European                | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian           | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                    | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                        | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese                 | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese              | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew               | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                     | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean                   | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian                | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek                | greek_general_ci    |      1 |
| cp1250   | Windows Central European        | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish              | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian              | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                     | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak      | keybcs2_general_ci  |      1 |
| macce    | Mac Central European            | macce_general_ci    |      1 |
| macroman | Mac West European               | macroman_general_ci |      1 |
| cp852    | DOS Central European            | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic              | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic                | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode                  | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode                | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic                  | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic                  | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode                  | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset           | binary              |      1 |
| geostd8  | GEOSTD8 Georgian                | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese       | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese       | eucjpms_japanese_ci |      3 |
| gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |
+----------+---------------------------------+---------------------+--------+
41 rows in set (0.02 sec)

in:

  • The first column (Charset) is the character set name;
  • The second column (Description) is the character set description;
  • The third column (Default collation) is the default collation rule of the character set;
  • The fourth column (Maxlen) indicates the maximum number of bytes occupied by a character in the character set.


Commonly used character sets are as follows:

  • latin1 supports Western European characters, Greek characters, etc.
  • gbk supports simplified Chinese characters.
  • big5 supports traditional Chinese characters.
  • utf8 supports characters from almost all countries.

MySQL database basic skills full practice icon-default.png?t=N7T8https://edu.csdn.net/course/detail/36210
You can also check the character sets supported by MySQL by querying the records in the information_schema.character_set table. The SQL statement and execution process are as follows:

mysql> SELECT * FROM information_schema.character_sets;
+--------------------+----------------------+---------------------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION                     | MAXLEN |
+--------------------+----------------------+---------------------------------+--------+
| big5               | big5_chinese_ci      | Big5 Traditional Chinese        |      2 |
| dec8               | dec8_swedish_ci      | DEC West European               |      1 |
| cp850              | cp850_general_ci     | DOS West European               |      1 |
| hp8                | hp8_english_ci       | HP West European                |      1 |
......

You can use  SHOW COLLATION LIKE '***';commands to view the collation rules of related character sets.

mysql> SHOW COLLATION LIKE 'gbk%';
+----------------+---------+----+---------+----------+---------+
| Collation      | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+----+---------+----------+---------+
| gbk_chinese_ci | gbk     | 28 | Yes     | Yes      |       1 |
| gbk_bin        | gbk     | 87 |         | Yes      |       1 |
+----------------+---------+----+---------+----------+---------+
2 rows in set (0.00 sec)

The above running result is the collation rule corresponding to the GBK character set, among which gbk_chinese_ci is the default collation rule and is not case-sensitive. gbk_bin compares based on binary coded values ​​and is case-sensitive.

You can also view the collation rules available in MySQL by querying the records in the information_schema.COLLATIONS table. The SQL statement and execution process are as follows:

mysql> SELECT * FROM information_schema.COLLATIONS;
+--------------------------+--------------------+-----+------------+-------------+---------+
| COLLATION_NAME           | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN |
+--------------------------+--------------------+-----+------------+-------------+---------+
| big5_chinese_ci          | big5               |   1 | Yes        | Yes         |       1 |
| big5_bin                 | big5               |  84 |            | Yes         |       1 |
| dec8_swedish_ci          | dec8               |   3 | Yes        | Yes         |       1 |
| dec8_bin                 | dec8               |  69 |            | Yes         |       1 |
| cp850_general_ci         | cp850              |   4 | Yes        | Yes         |       1 |
| cp850_bin                | cp850              |  80 |            | Yes         |       1 |
......
Example 1

Specify "A" and "a" to compare according to the gbk_chinese_ci and gbk_bin collation rules respectively. The SQL statements and running results are as follows:

mysql> SELECT CASE WHEN 'A' COLLATE gbk_chinese_ci = 'a' COLLATE gbk_chinese_ci then 1
    -> else 0 end;
+-------------------------------------------------------------------------------------+
| CASE WHEN 'A' COLLATE gbk_chinese_ci = 'a' COLLATE gbk_chinese_ci then 1
else 0 end |
+-------------------------------------------------------------------------------------+
|                                                                                   1 |
+-------------------------------------------------------------------------------------+
1 row in set (0.02 sec)

mysql> SELECT CASE WHEN 'A' COLLATE gbk_bin = 'a' COLLATE gbk_bin then 1
    -> else 0 end;
+-----------------------------------------------------------------------+
| CASE WHEN 'A' COLLATE gbk_bin = 'a' COLLATE gbk_bin then 1
else 0 end |
+-----------------------------------------------------------------------+
|                                                                     0 |
+-----------------------------------------------------------------------+
1 row in set (0.00 sec)

Since the gbk_chinese_ci collation rules ignore case, the two "A" and "a" are considered to be the same. The gbk_bin collation rule does not ignore case, and the two characters are considered different.

In actual applications, we should confirm in advance how the application needs to be sorted, whether it is case-sensitive, and then select the corresponding proofreading rules.

Mysql database basic skills full practice icon-default.png?t=N7T8https://edu.csdn.net/course/detail/36210

Guess you like

Origin blog.csdn.net/m0_37449634/article/details/135554392