The relationship between Unicode and UTF-8:
- The Unicode character set includes almost all known characters in the world.
- But Unicode does not specify how to store these characters and how to use binary storage.
- At this time, UTF-8 appeared, UTF-8 (8-bit Unicode Transformation Format). Similar to UTF-16, UTF-32.
- UTF-8 uses 1-4 bytes to encode each character, UTF-16 uses 2-4 bytes, and UTF-32 uses 4 bytes to encode each character.
- UTF-8 can automatically select the length of encoding according to different symbols. The defect of UTF-32 is obvious. For characters such as English characters, the space consumed is four times that of UTF-8.
- UTF-8 is currently the most widely used character encoding.
Notice:
The MySQL character encoding set has two sets of UTF-8 implementations:
- utf-8: utf-8 only supports 1-3 bytes. In utf-8 encoding, Chinese is three bytes, and other numbers, English, and symbols occupy one byte. However, emoji symbols occupy 4 bytes, and some more complex characters and traditional characters also occupy 4 bytes.
- utf-8mb4: A complete implementation of UTF-8 that supports up to four bytes to represent characters.