The difference between utf8 and utf8mb4 in MySQL

1. Introduction

MySQL added the encoding of utf8mb4 after 5.5.3. mb4 means most bytes 4, which is specially designed to be compatible with four-byte unicode. utf8mb4 is utf8

A superset of, there is no need to do other conversions except changing the encoding to utf8mb4. Of course, in order to save space, utf8 is usually enough.

2. Content description

As mentioned above, since utf8 can store most Chinese characters, why use utf8mb4? The original utf8 encoding supported by mysql has a maximum character length of 3 bytes. If it encounters a 4-byte wide character, an exception will be inserted. .

The maximum Unicode character that can be encoded by three-byte UTF-8 is 0xffff, which is the basic multilingual plane (BMP) in Unicode. In other words, any Unicode characters that are not in the basic multi-text plane cannot be stored in Mysql's utf8 character set. Including Emoji (Emoji) is a special Unicode encoding, commonly found on ios and android phones), and many uncommon Chinese characters, as well as any new Unicode characters, etc. (the shortcomings of utf8).

Generally, when a computer stores characters, it allocates storage space according to different types of characters and encoding methods. For example, the following encoding methods;

  1. In ASCII encoding, an English letter (not case sensitive) occupies one byte of space, and a Chinese character occupies two bytes of space. A binary number sequence, when stored as a digital unit in a computer, is generally an 8-bit binary number, converted to decimal. The minimum value is 0 and the maximum value is 255.
  2. In UTF-8 encoding, one English character occupies one byte of storage space, and one Chinese (including traditional) occupies three bytes of storage space.
  3. In Unicode encoding, one English occupies two bytes of storage space, and one Chinese (including traditional) occupies two bytes of storage space.
  4. In UTF-16 encoding, the storage of an English alphabet character or a Chinese character character requires 2 bytes of storage space (some Chinese characters in the Unicode extension area require 4 bytes to store).
  5. In UTF-32 encoding, the storage of any character in the world requires 4 bytes of storage space.

Since utf8 is compatible with most characters, why should utf8mb4 be extended?

With the development of the Internet, many new types of characters have been produced, such as emoji, which is the little yellow face expression we usually send when chatting. The appearance of this character is not among the basic multi-plane Unicode characters. , Resulting in the inability to use utf8 storage in MySQL, MySQL therefore extended the utf8 characters and added the utf8mb4 encoding.

Therefore, if you want to allow users to use special symbols when designing a database, it is best to use utf8mb4 encoding for storage, so that the database has better compatibility, but this design will lead to more storage space consumption.

Guess you like

Origin blog.csdn.net/qq_37823979/article/details/107634177