MySQL character set and its collation

Better reading experience\color{red}{\huge{better reading experience}}better reading experience

Basic introduction to utf8mb4


Basic Features


  • utf8mb4 is a character set encoding in MySQL that can store and process Unicode characters.

  • The Unicode character set contains almost all characters, including characters, symbols, emoticons, etc. of various languages.


The difference with utf8mb3


version


  • The utf8mb4 character set is supported in MySQL version 5.5.3 and later.
  • Prior to this version of MySQL, only the utf8 character set was supported, ie utf8mb3.

coding


  • In MySQL, the utf8 character set actually only supports UTF-8 encoding up to 3 bytes. This means it cannot store and handle some special characters correctly, like some emoji and some auxiliary characters.
  • In order to solve the limitations of the utf8 character set, MySQL introduced the utf8mb4 character set. The utf8mb4 character set supports up to 4 bytes of UTF-8 encoding, which can represent a wider range of characters, including some special characters and emoji.

utf8mb4 collation


common collation


  • utf8mb4_general_ci:
    • The default collation is case-insensitive and multilingual collations are considered.
    • Under this rule, 'a' and 'A' are considered equal.
  • utf8mb4_unicode_ci:
    • Based on the Unicode Collation Algorithm (UCA) default collation, case insensitive .
    • Compared with utf8mb4_general_ci, utf8mb4_unicode_ci is more precise and can correctly sort the characters of various languages.
  • utf8mb4_bin:
    • This collation is a binary-based collation, case-sensitive , and sorts according to the binary value of the characters.
    • Under this rule, 'A' will come before 'a'.
  • utf8mb4_0900_ai_ci:
    • Introduced in MySQL 8.0.0, a new collation to support the utf8mb4 character set.
    • In versions prior to MySQL 8.0.0, the utf8mb4 character set used the utf8mb4_general_ci collation. However, this sorting rule is not accurate enough for some specific character comparisons, which may cause some sorting and comparison results to be unexpected.
    • Based on the collation rules of Unicode Collation Algorithm (UCA) 9.0.0, it is case-insensitive and handles the sorting and comparison of various characters more accurately.

In addition to the common collations mentioned above, MySQL also provides some other collations, such as utf8mb4_unicode_520_ci, utf8mb4_unicode_520_bin, etc. These rules can be selected and used according to specific needs.


default collation


When setting the table's default character set to the utf8mb4 character set but not explicitly specifying a collation:

  • In MySQL 5.7, the default collation is utf8mb4_general_ci.
  • In MySQL 8.0, the default collation is utf8mb4_0900_ai_ci.

Compatibility issues


Since the utf8mb4_0900_ai_ci collation is a collation introduced by MySQL 8.0, when a table of MySQL 8.0 is imported to MySQL 5.7 or MySQL 5.6, there will be a problem that the character set cannot be recognized.

  • [Err] 1273 - Unknown collation: 'utf8mb4_0900_ai_ci'
    
  • Solution: Modify the collation of the newly created database or manually modify all the collations in the sql file.


Comparison of utf8mb4_unicode_ci and utf8mb4_general_ci

  • accuracy:
    • The utf8mb4_unicode_ci sorting rule is based on standard unicode for sorting and comparison, can handle special characters, and can sort accurately in various voices.
    • The utf8mb4_general_ci collation is not based on standard unicode and cannot handle some special characters.
  • Performance:
    • The utf8mb4_general_ci collation is relatively good in sorting performance;
    • The utf8mb4_unicode_ci collation implements complex sorting algorithms for special characters, and its performance is slightly worse.
    • In most scenarios, there is no significant performance difference between the two

Server level sort parameter control


collation_server


  • Cited in MySQL 5.6 collation_serveras a system variable, it is used to specify the default character set collation at the server level.
  • It defines the default character set collation used when creating new tables

View collation_serverthe value of the current MySQL server:

SHOW VARIABLES LIKE 'collation_server';

The command will return a result set containing collation_serverthe variable named and its corresponding value.

Note :

  • collation_serveris a server-level variable whose value is set when the MySQL server starts.
  • Usually configured in configuration files (such as my.cnf or my.ini), restart the MySQL server to take effect.

Default parameter rules


  • If the value of the parameter collation_database is not specified when the service starts, the value of the parameter collation_server will be inherited by default.
  • If no collation is specified when creating the database, the value of the parameter collation_database is used by default.

Note :

  • The parameters character_set_database and collation_database were deprecated in MySQL 5.7 and will be removed in subsequent releases.
  • The new MySQL parameter default_collation_for_utf8mb4 is used to control the default collation when using the utf8mb4 character set, and the value is utf8mb4_0900_ai_ci or utf8mb4_general_ci
  • The parameter default_collation_for_utf8mb4 takes effect in the following conditions:
    • When using the SHOW COLLATION and SHOW CHARACTER SET commands.
    • When creating or modifying a library specifying utf8mb4 but not specifying encoding rules.
    • When creating or modifying a table specifying utf8mb4 but not specifying an encoding rule.
    • When adding or modifying a column, utf8mb4 is specified but no encoding rule is specified.
    • Others when utf8mb4 is used but no encoding rules are specified.

Guess you like

Origin blog.csdn.net/LYS00Q/article/details/131512467