Unicode and UTF-8 encoding

The relationship between Unicode and UTF-8:

  • The Unicode character set includes almost all known characters in the world.
  • But Unicode does not specify how to store these characters and how to use binary storage.
  • At this time, UTF-8 appeared, UTF-8 (8-bit Unicode Transformation Format). Similar to UTF-16, UTF-32.
  • UTF-8 uses 1-4 bytes to encode each character, UTF-16 uses 2-4 bytes, and UTF-32 uses 4 bytes to encode each character.
  • UTF-8 can automatically select the length of encoding according to different symbols. The defect of UTF-32 is obvious. For characters such as English characters, the space consumed is four times that of UTF-8.
  • UTF-8 is currently the most widely used character encoding.

Notice:

The MySQL character encoding set has two sets of UTF-8 implementations:

  • utf-8: utf-8 only supports 1-3 bytes. In utf-8 encoding, Chinese is three bytes, and other numbers, English, and symbols occupy one byte. However, emoji symbols occupy 4 bytes, and some more complex characters and traditional characters also occupy 4 bytes.
  • utf-8mb4: A complete implementation of UTF-8 that supports up to four bytes to represent characters.

Guess you like

Origin blog.csdn.net/qq_45800977/article/details/130361441