UTF8 encoding principles and whitelist filters utf8mb4 (Caused by: java.sql.BatchUpdateException: Incorrect string value)

These days encountered Mysql data reported for falling coding error:

Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\x98\x8A',...' for column 'statement_text' at row 1

Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xF0\xA0\x81\x81%'...' for column 'statement_sample' at row 1

Internet provides most of the solution is to modify the configuration database, but the database if the connection pool can not guarantee that you do not specify utf-8 Shi other connections, it can not avoid contamination of other connections the connection pool. Here is another solution, filtering out special characters.

1 UTF-8 encoding segment

UTF-8 (8-bit Unicode Transformation Format) is a variable-length Unicode character for encoding, also a prefix code. It can be used to represent any character in the Unicode standard, and the encoding of the first byte is still compatible with ASCII, which makes software handling the original ASCII characters do not need or only a small part of the changes, you can continue to use. Therefore, it has gradually become encoded e-mail, web pages and other storage or send text adopted priority.

1.0 Symbols query methods

http://www.fileformat.info/info/unicode/char/xxxxx/index.htm

Replace xxxx need to query character hexadecimal encoding

SMILING FACE emoji example of encoding is 1f60a: inquiry address
character is encoded as a 61: Query Address

1.1 Ascii

128 US-ASCII characters only one byte encoding (Unicode range of U + 0000 to U + 007F)

E.g

Hex (JAVA) Graph
“\u0060” `
“\u0061” a
“\u0062” b
"\ U0063" c
“\u0064” d
“\u0065” e

1.2 Latin, etc.

Latin with additional symbols, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and thaana needs two bytes coding (Unicode range of U + 0080 to U + 07FF ).

Hex (JAVA) Graph Queries connection
“\u0550” U link
“\u0450” or link

1.3 Chinese, etc.

Other basic multilingual plane (BMP) characters (which includes most commonly used words, such as most Chinese characters) using three bytes encoding (Unicode range of U + 0800 to U + FFFF).

Hex (JAVA) Graph Queries connection
"\ U9AD8" high link
"\ U738B" king link

1.4 Other

Other Unicode character rarely used auxiliary plane using four to six byte encoding (Unicode range of U + 10000 to U + 1FFFFF using four bytes, Unicode range of U + 200000 to U + 3FFFFFF use of five bytes, Unicode range of U + 4000000 to U + 7FFFFFFF using six bytes).

Hex (JAVA) Graph Queries connection
“\uD83D\uDE0A” ? link
"\ UD83D \ uDE0F" ? link

2 UTF-8 encoding Byte Meaning

  • For the UTF-8 encoding any byte B, B if the first bit is 0, then B independently represent a character (ASCII code);
  • If the first bit B is 1, the second bit is 0, the byte B is a (non-ASCII characters) in a multi-byte character;
  • If the first two bits B is 1, the third bit is 0, then B is a two-byte character represented by the first byte;
  • If the top three B is 1, the fourth bit is 0, the character B is three bytes in the first byte;
  • If the first four bits of B 1, the fifth bit is 0, then B is represented by four bytes in the first byte of the character;

Thus, for any byte UTF-8 encoding, in accordance with the first, it may determine whether to ASCII characters; the character byte is a byte encoded according to the preceding first two, can be determined; The first four bit (if the first two bits are 1), may determine that the first byte is a byte character encoding, and can determine a corresponding character represented by several bytes; the first five (if the first four bits 1 ), there can be determined whether the encoded data error or whether there is an error during transmission.

Digit code point Code point values ​​from Final value code point Sequence of bytes Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U + 0000 U+007F 1 0xxxxxxx
11 U + 0080 U + 07FF 2 110xxxxx 10xxxxxx
16 U + 0800 U + FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U + 10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U + 200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U + 4000000 U + 7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  • In the range of ASCII code, with a byte beyond the ASCII range, it says bytes, which formed the representation we have seen above, UTF-8, the benefit is only when the UNICODE file ASCII code, the files are stored as a byte, it is no different from an ordinary ASCII file, when read is true, it is compatible with the previous ASCII file.
  • Greater than ASCII code, it will indicate the length of the unicode character by the former first few bytes of the above, such as the binary representation of the top three 110xxxxx told us it was a 2BYTE the UNICODE character; 1110xxxx was three of UNICODE characters, and so on; xxx bit positions represented by the character encoding of binary numbers filled in. The right of x has fewer special significance. Expressing a character with only a sufficient number of multi-byte sequence encoding the shortest. Note that in the multi-byte string, a number of beginning of the first byte "1" is a whole number of bytes in the string.

3 the Java filter 4 UTF-8 encoded character word length (word length reserved character 3)

  • As mentioned in the above 1.2, and 1.3, the words length coding has largely kept its regular characters, use a whitelist to retain this part of the character meet general business needs, filter out specific strings (MYSQL to solve the problem of special characters can not be inserted ).

  • 4-word character is Unicode UTF-8 character the SMP (auxiliary plane) that is larger than U + FFFF Unicode code characters, so we only need to get a string of characters each code point, when code point greater than FFFF (or directly used to determine Character.isSupplementaryCodePoint), can be filtered off.

Sample code is as follows:

    @Test
    public void filterUtf8mb4Test() {
        String s = "a中\uD83D\uDD11a中";
        log.info(filterUtf8mb4(s));
    }

    public static String filterUtf8mb4(String str) {
        final int LAST_BMP = 0xFFFF;
        StringBuilder sb = new StringBuilder(str.length());
        for (int i = 0; i < str.length(); i++) {
            int codePoint = str.codePointAt(i);
            if (codePoint < LAST_BMP) {
                sb.appendCodePoint(codePoint);
            } else {
                i++;
            }
        }
        return sb.toString();
    }

The output is:

a中a中

4 Some notes

I do not want to restart the program

Perform the following in the current session before performing the second election

set character_set_client = utf8mb4;

SET NAMES utf8mb4;

Server modification record

(To change the whole, change over the restart)

[client] 
default-character-set = utf8mb4
[mysqld]
character-set-server = utf8mb4 
collation-server = utf8mb4_unicode_ci
[mysql] 
default-character-set = utf8mb4

Refer to the official documentation
mysql Field Meaning
parameter
Character Set

SQL to modify the character set

ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(length) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

5 references

Character Set Data Source
https://zh.wikipedia.org/wiki/UTF-8
http://www.fileformat.info
https://www.cnblogs.com/chrischennx/p/6623610.html
modify databases please reference
https://blog.csdn.net/hzw19920329/article/details/55670782

Published 27 original articles · won praise 2 · views 50000 +

Guess you like

Origin blog.csdn.net/jackgo73/article/details/89957646