测试数据:
CREATE TABLE `test` (`name` varchar(30)); insert into test values('abc'); insert into test values('Aaa'); insert into test values('ccc');
从遇坑说起
应用场景:查询表test中以大写A开头的内容,于是有
19:54:18[5.7.25-log]root->192.168.30.20[mtest]> select * from test where name like 'A%'; +------+ | name | +------+ | abc | | Aaa | +------+ 2 rows in set (0.00 sec)
sql跑出来的结果不是想要的,abc这个结果也出来了。
可以看到默认数据库环境中mysql中对字符串过滤是不区分大小写的。
匹配大小写正确姿势
方法一:改写sql,字符串前加binary过滤规则
select * from test where name like binary'A%'; +------+ | name | +------+ | Aaa | +------+ 1 row in set (0.00 sec)
方法二:字段指定binary
alter table test modify name varchar(30) binary; 20:04:18[5.7.25-log]root->192.168.30.20[mtest]> select * from test where name like 'A%'; +------+ | name | +------+ | Aaa | +------+ 1 row in set (0.00 sec)
方法三:创建表时指定字符序为区分大小写
CREATE TABLE `test` (`name` varchar(30)) default charset=utf8mb4 collate=utf8mb4_bin; insert into test values('abc'); insert into test values('Aaa'); insert into test values('ccc'); select * from test where name like binary'A%'; +------+ | name | +------+ | Aaa | +------+ 1 row in set (0.00 sec)
字符序影响字符比较时是否大小写敏感
mysql中常用字符集默认字符序如下:
show character set; 或者查询information_schema.CHARACTER_SETS +----------+---------------------------------+---------------------+--------+ | Charset | Description | Default collation | Maxlen | +----------+---------------------------------+---------------------+--------+ | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 | | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | | gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | 可以看到常用的utf8mb4字符集的默认字符序是utf8mb4_general_ci 接着可以查看下utf8mb4字符集都支持设置哪些字符序: show collation like 'utf8mb4%'; 或者查询information_schema.COLLATIONS +------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +------------------------+---------+-----+---------+----------+---------+ | utf8mb4_general_ci | utf8mb4 | 45 | Yes | Yes | 1 | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | | utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 | | utf8mb4_latvian_ci | utf8mb4 | 226 | | Yes | 8 | | utf8mb4_romanian_ci | utf8mb4 | 227 | | Yes | 8 | | utf8mb4_slovenian_ci | utf8mb4 | 228 | | Yes | 8 | | utf8mb4_polish_ci | utf8mb4 | 229 | | Yes | 8 | | utf8mb4_estonian_ci | utf8mb4 | 230 | | Yes | 8 | | utf8mb4_spanish_ci | utf8mb4 | 231 | | Yes | 8 | | utf8mb4_swedish_ci | utf8mb4 | 232 | | Yes | 8 | | utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | | utf8mb4_lithuanian_ci | utf8mb4 | 236 | | Yes | 8 | | utf8mb4_slovak_ci | utf8mb4 | 237 | | Yes | 8 | | utf8mb4_spanish2_ci | utf8mb4 | 238 | | Yes | 8 | | utf8mb4_roman_ci | utf8mb4 | 239 | | Yes | 8 | | utf8mb4_persian_ci | utf8mb4 | 240 | | Yes | 8 | | utf8mb4_esperanto_ci | utf8mb4 | 241 | | Yes | 8 | | utf8mb4_hungarian_ci | utf8mb4 | 242 | | Yes | 8 | | utf8mb4_sinhala_ci | utf8mb4 | 243 | | Yes | 8 | | utf8mb4_german2_ci | utf8mb4 | 244 | | Yes | 8 | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | | utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | | utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | +------------------------+---------+-----+---------+----------+---------+
字符序命名规则说明:
每一种字符集都可能对应多种比较规则,规律如下:
- 字符序名称以对应的字符集名称开头
- 中间部分表示主要哪种语言
- 后缀有以下几种:
后缀 |
全称 |
含义 |
`_ai` |
accent insenstitive |
不区分重音 |
`_as` |
accent senstitive |
区分重音 |
`_ci` |
case insensitive |
不区分大小写 |
`_cs` |
case sensitive |
区分大小写 |
`_bin` |
binary |
以二进制方式比较 |
回到最初坑的例子总结来说:数据库字符集为utf8mb4,创建表时保持默认情况下,字符序为utf8mb4_general_ci 不区分大小写,所以就遇到了筛选字符时大小写的坑。
数据库、表、字段的字符集和字符序应用规则
受以下参数影响:
show variables like '%character%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ show variables like '%collation%'; +----------------------+--------------------+ | Variable_name | Value | +----------------------+--------------------+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8mb4_general_ci | +----------------------+--------------------+
字符集和字符序相关系统参数不展开,具体可以参考官方文档。
database字符集、字符序应用规则:
- 创建数据库时,指定了character set或collate,则以对应的字符集、字符序规则为准。
- 创建数据库时,如果没有指定字符集、排序规则,则以character_set_server、collation_server为准。
- 创建数据库时,如果只指定了字符集,则以字符集对应的默认collate为准。
table字符集、字符序应用规则:
假设创建表时character set 、collate的值分别是charset_name、collation_name。如果创建table时:
- 明确了表的charset_name、collation_name,则采用明确了的字符集和字符序。
- 只明确了charset_name,但collation_name未明确,则字符集采用charset_name,字符序采用charset_name对应的默认字符序。
- 只明确了collation_name,但charset_name为明确,则字符序采用collation_name,字符集采用collation_name关联的字符集。
- charset_name、collation_name均未明确,则采用数据库的字符集、字符序设置。
column的字符集、字符序:
表上列的规则和上面说的表应用规则一样,只不过列在未指定字符集 字符序时,继承的是表的设定。
Oracle、TiDB中字符比较默认区分大小写
Oracle字符集默认语言英文、中文、日文等排序规则都是二进制,所以是区分大小写的。
- 数据库中查询所支持的所有字符集:
下面是常用的字符集列表:
select * from V$NLS_VALID_VALUES where PARAMETER = 'CHARACTERSET'
Characterset | Supported Languages (+ English) |
WE8ISO8859P15 (ISO 8859-15), WE8MSWIN1252 | Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, Finnish, French, Frisian, Galician, German, Greenlandic, Irish Gaelic (new orthography), Italian, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish .The Euro symbol |
NEE8ISO8859P4 (ISO 8859-4), BLT8MSWIN1257 | Danish, Estonian, Finnish, German, Greenlandic, Latvian, Lithuanian, Norwegian, Sami, Slovenian, Swedish. (1257 also supports the Euro symbol) |
CL8ISO8859P5 (ISO 8859-5), CL8MSWIN1251 | Bulgarian, Belarussian (previously know as Byelorussian), Slavic Macedonian, Russian, Sebian, Ukrainian. (1251 also supports the Euro symbol) |
JA16SJIS | Japanese |
ZHS16GBK | Simplified Chinese |
ZHT16MSWIN950 | Traditional Chinese (Taiwan) |
ZHT16HKSCS and ZHT16HKSCS31 | Traditional Chinese (Hong Kong) |
UTF8, AL32UTF8, AL16UTF16 | All above languages and many more Note 1051824.6 What languages are supported in an Unicode (UTF8/AL32UTF8) database? |
- Oracle支持的语言默认排序规则
参考资料:Database Globalization Support Guide -> A Locale Data -> Languages
下面简单列出下常用的:
语言名称 | 缩写 | 默认排序规则 |
---|---|---|
AMERICAN | us | binary |
FRENCH | f | FRENCH |
GERMAN | d | GERMAN |
SIMPLIFIED CHINESE | zhs | binary |
TRADITIONAL CHINESE | zht | binary |
JAPANESE | ja | binary |
KOREAN | ko | binary |
... | ... | ... |
可见,对于亚洲的语言包括中文在内,大部分默认排序规则都是二进制。如果想要按照汉字的拼音或者部首来排序,则需要设置nls_sort参数
对于TiDB 3.0,字符集默认的排序规则都是二进制:
tidb中是字符比较是严格区分大小写的
show collation; +-------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------+---------+------+---------+----------+---------+ | utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | | latin1_bin | latin1 | 47 | Yes | Yes | 1 | | binary | binary | 63 | Yes | Yes | 1 | | ascii_bin | ascii | 65 | Yes | Yes | 1 | | utf8_bin | utf8 | 83 | Yes | Yes | 1 | +-------------+---------+------+---------+----------+---------+
表名、字段名、SQL关键字大小写规则
对于Oracle来说,表名、字段名全部以大写形式存储在数据字典中,sql中使用时不区分表名、字段名大小写。
对于MySQL/TiDB来说,表名默认是以小写形式存放,受参数lower_case_table_names控制。字段名在数据字典中按照创建时大小写形式存放。sql中使用时不区分表名、字段名大小写。
使用保留的关键字规则:
Oracle: 使用双引号 "字段名" ,但这样该字段就要区分大小写了。意味着双引号中的字段名要大写。
MySQL/TiDB:使用反引号 `字段名`,不区分大小写。
注意:mysql/tidb的sql中双引号中字符 代表字符串,不会像oracle那样解析为字段名。而oracle中字符使用单引号表示。