MySQL giant pit: Never use UTF-8 in MySQL

I recently encountered a bug, I tried to save a UTF-8 string to "utf8" encoding MariaDB by Rails, then there is a bizarre error:

Incorrect string value: ‘😃 <…’ for column ‘summary’ at row 1

I use UTF-8 encoding of the client, the server is UTF-8 encoded database is, even to save the string "<..." is also a legal UTF-8.

The crux of the problem is that, MySQL is "utf8" is not actually true UTF-8.

"Utf8" only supports up to three bytes per character, while the real UTF-8 is up to four bytes per character.

MySQL has not fix this bug, they released a called "utf8mb4" character set in 2010, to bypass this problem.

Of course, they are not widely advertised to the new character set of (possibly because this bug so that they feel very embarrassed), so that now the network is still recommended that developers use the "utf8", but these proposals are wrong.

Summarized as follows:

1.MySQL of "utf8mb4" is the real "UTF-8".

2.MySQL of "utf8" is an "exclusive code", it is able to encode Unicode characters is not much.

I want to clarify here: All in the use of "utf8" of MySQL and MariaDB users should switch to "utf8mb4", never again use "utf8".

So what is encoded? What is UTF-8?

We all know that computers use 0 and 1 for storing text. Such as character "C" is stored as "01000011", and when computer display this character consists of two steps:

1. The computer reads "01000011", to obtain a digital 67, since 67 is encoded to "01000011."

2. Find the Unicode character set in the computer 67, found a "C".

same:

1. My computer will be "C" is mapped into the Unicode character set 67.

2. My computer is encoded into the 67 "01000011", and sent to the Web server.

Almost all network applications use the Unicode character set, because there is no reason to use a different character set.

Unicode character set contains millions of characters. The simplest coding is UTF-32, 32-bit per character. So do the easiest, because all along, the computer 32 as numbers, but the computer does best is to deal with numbers. But the problem is that such a waste of space.

UTF-8 can save space in the UTF-8, the character "C" requires only eight, some of the characters not commonly used, such as "" requires 32 bits. Other characters may use 16-bit or 24-bit. An article like this, if you use UTF-8 encoding, the space occupied by only about one-quarter of UTF-32.

MySQL's "utf8" character set is not compatible with other programs, it is called "" could really cook ......

A Brief History of MySQL

Why MySQL developers make "utf8" fail? We may be able to find the answer from the commit log in.

From MySQL 4.1 version began to support UTF-8, that is, in 2003, and today the use of UTF-8 standard (RFC 3629) is followed phenomena.

Older the UTF-8 standard (RFC 2279) supports up to 6 bytes per character. March 28, 2002, MySQL developers using RFC 2279 in the first MySQL 4.1 preview release.

In September, they made an adjustment on the MySQL source code: "UTF8 now only supports up to 3-byte sequence."

Who submitted these codes? Why did he do this? This problem is not known. After migrating to Git (MySQL initially using BitKeeper), many of the submitter's name MySQL code base is lost. September 2003 mailing list can not find clues to explain this change.

But I can try to guess.

In 2002, MySQL made a decision: If the user can ensure that each row in the data table with the same number of bytes, then MySQL can be a big increase in performance. To do this, you need the text column is defined as "CHAR", each "CHAR" column always have the same number of characters. If you insert a character less than the number defined, MySQL will fill a space in the back, if you insert characters exceeds the number of definitions, followed by the excess will be truncated.

MySQL developers attempt at the beginning of the use of UTF-8 6 bytes for each character, CHAR (1) using 6 bytes, CHAR (2) using 12 bytes, and so forth.

It should be said that their initial behavior is correct, but this version has not been released. But the document was so written, but widely circulated, all understand UTF-8 who agree to write something in the document.

But it is clear, MySQL developers or manufacturers worried about the user to do two things:

1. Use CHAR defined columns (for now, CHAR is already old as the hills, but at the time, use CHAR in MySQL will be faster, but from 2005 onwards the child is not the case).

2. The coding CHAR column to "utf8".

My guess is that MySQL developers had wanted to help those who want to win in space and speed, but they messed up "utf8" encoding.

So the result is that there are no winners. Users who want to win in space and speed, when more space when they use "utf8" column of CHAR, actually used than expected, the speed is slower than expected. The correctness users want, when they use the "utf8" encoding, but can not save like "" this character.

After this illegal character sets released, MySQL can not fix it, because it requires that all users need to rebuild their database. Finally, MySQL in 2010, re-released "utf8mb4" to support true UTF-8.

Why this thing will make people so crazy

Because of this problem, I whole crazy for a week. I was "utf8" fooled, spent a lot of time to find this bug. But I'm certainly not the only one on the network almost all the articles regard the "utf8" as is true of UTF-8.

"Utf8" can only be regarded as a specific set of characters, it gives us a new problem, but it has not been resolved.

to sum up

If you are using MySQL or MariaDB, not to use "utf8" encoding, use "utf8mb4". Here ( https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 ) provides a guide for the character code from an existing database "utf8" turned into "utf8mb4".
If you want to learn Java-engineered, high-performance and distributed in layman's language. Micro Services, Spring, MyBatis, Netty source code analysis of Java friends can add my senior exchange: 787 707 172, the group has a large Ali live cattle to explain the technology, as well as large Internet video Java technology free for everyone to share.

Guess you like

Origin blog.51cto.com/13954634/2402553