数据库中表名、字段名、字符串大小写处理规则

测试数据:

CREATE TABLE `test` (`name` varchar(30));
insert into test values('abc');
insert into test values('Aaa');
insert into test values('ccc');

从遇坑说起

应用场景:查询表test中以大写A开头的内容,于是有

19:54:18[5.7.25-log]root->192.168.30.20[mtest]> select * from test where name like 'A%';
+------+
| name |
+------+
| abc  |
| Aaa  |
+------+
2 rows in set (0.00 sec)

sql跑出来的结果不是想要的,abc这个结果也出来了。

可以看到默认数据库环境中mysql中对字符串过滤是不区分大小写的。

匹配大小写正确姿势

方法一:改写sql,字符串前加binary过滤规则

select * from test where name like binary'A%';
+------+
| name |
+------+
| Aaa  |
+------+
1 row in set (0.00 sec)

方法二:字段指定binary

alter table test modify name varchar(30) binary;

20:04:18[5.7.25-log]root->192.168.30.20[mtest]> select * from test where name like 'A%';
+------+
| name |
+------+
| Aaa  |
+------+
1 row in set (0.00 sec)

方法三:创建表时指定字符序为区分大小写

CREATE TABLE `test` (`name` varchar(30)) default charset=utf8mb4 collate=utf8mb4_bin;
insert into test values('abc');
insert into test values('Aaa');
insert into test values('ccc');

select * from test where name like binary'A%';
+------+
| name |
+------+
| Aaa  |
+------+
1 row in set (0.00 sec)

字符序影响字符比较时是否大小写敏感

mysql中常用字符集默认字符序如下:

show character set; 或者查询information_schema.CHARACTER_SETS
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
| gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
| utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
| gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |

可以看到常用的utf8mb4字符集的默认字符序是utf8mb4_general_ci
接着可以查看下utf8mb4字符集都支持设置哪些字符序:
show collation like 'utf8mb4%'; 或者查询information_schema.COLLATIONS
+------------------------+---------+-----+---------+----------+---------+
| Collation              | Charset | Id  | Default | Compiled | Sortlen |
+------------------------+---------+-----+---------+----------+---------+
| utf8mb4_general_ci     | utf8mb4 |  45 | Yes     | Yes      |       1 |
| utf8mb4_bin            | utf8mb4 |  46 |         | Yes      |       1 |
| utf8mb4_unicode_ci     | utf8mb4 | 224 |         | Yes      |       8 |
| utf8mb4_icelandic_ci   | utf8mb4 | 225 |         | Yes      |       8 |
| utf8mb4_latvian_ci     | utf8mb4 | 226 |         | Yes      |       8 |
| utf8mb4_romanian_ci    | utf8mb4 | 227 |         | Yes      |       8 |
| utf8mb4_slovenian_ci   | utf8mb4 | 228 |         | Yes      |       8 |
| utf8mb4_polish_ci      | utf8mb4 | 229 |         | Yes      |       8 |
| utf8mb4_estonian_ci    | utf8mb4 | 230 |         | Yes      |       8 |
| utf8mb4_spanish_ci     | utf8mb4 | 231 |         | Yes      |       8 |
| utf8mb4_swedish_ci     | utf8mb4 | 232 |         | Yes      |       8 |
| utf8mb4_turkish_ci     | utf8mb4 | 233 |         | Yes      |       8 |
| utf8mb4_czech_ci       | utf8mb4 | 234 |         | Yes      |       8 |
| utf8mb4_danish_ci      | utf8mb4 | 235 |         | Yes      |       8 |
| utf8mb4_lithuanian_ci  | utf8mb4 | 236 |         | Yes      |       8 |
| utf8mb4_slovak_ci      | utf8mb4 | 237 |         | Yes      |       8 |
| utf8mb4_spanish2_ci    | utf8mb4 | 238 |         | Yes      |       8 |
| utf8mb4_roman_ci       | utf8mb4 | 239 |         | Yes      |       8 |
| utf8mb4_persian_ci     | utf8mb4 | 240 |         | Yes      |       8 |
| utf8mb4_esperanto_ci   | utf8mb4 | 241 |         | Yes      |       8 |
| utf8mb4_hungarian_ci   | utf8mb4 | 242 |         | Yes      |       8 |
| utf8mb4_sinhala_ci     | utf8mb4 | 243 |         | Yes      |       8 |
| utf8mb4_german2_ci     | utf8mb4 | 244 |         | Yes      |       8 |
| utf8mb4_croatian_ci    | utf8mb4 | 245 |         | Yes      |       8 |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 |         | Yes      |       8 |
| utf8mb4_vietnamese_ci  | utf8mb4 | 247 |         | Yes      |       8 |
+------------------------+---------+-----+---------+----------+---------+

字符序命名规则说明:

每一种字符集都可能对应多种比较规则,规律如下:

  • 字符序名称以对应的字符集名称开头
  • 中间部分表示主要哪种语言
  • 后缀有以下几种:

后缀

全称

含义

`_ai`

accent insenstitive

不区分重音

`_as`

accent senstitive

区分重音

`_ci`

case insensitive

不区分大小写

`_cs`

case sensitive

区分大小写

`_bin`

binary

以二进制方式比较

回到最初坑的例子总结来说:数据库字符集为utf8mb4,创建表时保持默认情况下,字符序为utf8mb4_general_ci 不区分大小写,所以就遇到了筛选字符时大小写的坑。

数据库、表、字段的字符集和字符序应用规则

受以下参数影响:

show variables like '%character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

show variables like '%collation%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8_general_ci    |
| collation_database   | utf8_general_ci    |
| collation_server     | utf8mb4_general_ci |
+----------------------+--------------------+

字符集和字符序相关系统参数不展开,具体可以参考官方文档。

database字符集、字符序应用规则:

  • 创建数据库时,指定了character set或collate,则以对应的字符集、字符序规则为准。
  • 创建数据库时,如果没有指定字符集、排序规则,则以character_set_server、collation_server为准。
  • 创建数据库时,如果只指定了字符集,则以字符集对应的默认collate为准。

table字符集、字符序应用规则:

假设创建表时character set 、collate的值分别是charset_name、collation_name。如果创建table时:

  • 明确了表的charset_name、collation_name,则采用明确了的字符集和字符序。
  • 只明确了charset_name,但collation_name未明确,则字符集采用charset_name,字符序采用charset_name对应的默认字符序。
  • 只明确了collation_name,但charset_name为明确,则字符序采用collation_name,字符集采用collation_name关联的字符集。
  • charset_name、collation_name均未明确,则采用数据库的字符集、字符序设置。

column的字符集、字符序:

表上列的规则和上面说的表应用规则一样,只不过列在未指定字符集 字符序时,继承的是表的设定。

Oracle、TiDB中字符比较默认区分大小写

Oracle字符集默认语言英文、中文、日文等排序规则都是二进制,所以是区分大小写的。

  • 数据库中查询所支持的所有字符集:

下面是常用的字符集列表:

select * from V$NLS_VALID_VALUES where PARAMETER = 'CHARACTERSET'
Characterset Supported Languages (+ English)
WE8ISO8859P15 (ISO 8859-15), WE8MSWIN1252 Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic (new orthography), Italian, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish .The Euro symbol
NEE8ISO8859P4 (ISO 8859-4), BLT8MSWIN1257 Danish, Estonian, Finnish, German, Greenlandic, Latvian, Lithuanian, Norwegian, Sami, Slovenian, Swedish. (1257 also supports the Euro symbol)
CL8ISO8859P5 (ISO 8859-5), CL8MSWIN1251 Bulgarian, Belarussian (previously know as Byelorussian), Slavic Macedonian, Russian, Sebian, Ukrainian. (1251 also supports the Euro symbol)
JA16SJIS Japanese
ZHS16GBK Simplified Chinese
ZHT16MSWIN950 Traditional Chinese (Taiwan)
ZHT16HKSCS and ZHT16HKSCS31 Traditional Chinese (Hong Kong)
UTF8, AL32UTF8, AL16UTF16 All above languages and many more Note 1051824.6 What languages are supported in an Unicode (UTF8/AL32UTF8) database?
  • Oracle支持的语言默认排序规则

参考资料:Database Globalization Support Guide -> A Locale Data -> Languages

下面简单列出下常用的:

语言名称 缩写 默认排序规则
AMERICAN us binary
FRENCH f FRENCH
GERMAN d GERMAN
SIMPLIFIED CHINESE zhs binary
TRADITIONAL CHINESE zht binary
JAPANESE ja binary
KOREAN ko binary
... ... ...

可见,对于亚洲的语言包括中文在内,大部分默认排序规则都是二进制。如果想要按照汉字的拼音或者部首来排序,则需要设置nls_sort参数

对于TiDB 3.0,字符集默认的排序规则都是二进制:

tidb中是字符比较是严格区分大小写的

show collation;
+-------------+---------+------+---------+----------+---------+
| Collation   | Charset | Id   | Default | Compiled | Sortlen |
+-------------+---------+------+---------+----------+---------+
| utf8mb4_bin | utf8mb4 |   46 | Yes     | Yes      |       1 |
| latin1_bin  | latin1  |   47 | Yes     | Yes      |       1 |
| binary      | binary  |   63 | Yes     | Yes      |       1 |
| ascii_bin   | ascii   |   65 | Yes     | Yes      |       1 |
| utf8_bin    | utf8    |   83 | Yes     | Yes      |       1 |
+-------------+---------+------+---------+----------+---------+

表名、字段名、SQL关键字大小写规则

对于Oracle来说,表名、字段名全部以大写形式存储在数据字典中,sql中使用时不区分表名、字段名大小写。

对于MySQL/TiDB来说,表名默认是以小写形式存放,受参数lower_case_table_names控制。字段名在数据字典中按照创建时大小写形式存放。sql中使用时不区分表名、字段名大小写。

使用保留的关键字规则:

Oracle: 使用双引号 "字段名" ,但这样该字段就要区分大小写了。意味着双引号中的字段名要大写。

MySQL/TiDB:使用反引号 `字段名`,不区分大小写。

注意:mysql/tidb的sql中双引号中字符 代表字符串,不会像oracle那样解析为字段名。而oracle中字符使用单引号表示。

猜你喜欢

转载自blog.csdn.net/u010033674/article/details/113175055