Skip to content

Commit

Permalink
*: add gbk info
Browse files Browse the repository at this point in the history
  • Loading branch information
zimulala committed Dec 31, 2021
1 parent 1f1c05c commit 01f4725
Show file tree
Hide file tree
Showing 4 changed files with 125 additions and 13 deletions.
4 changes: 3 additions & 1 deletion TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,9 @@
- [视图](/views.md)
- [分区表](/partitioned-table.md)
- [临时表](/temporary-tables.md)
- [字符集和排序规则](/character-set-and-collation.md)
- 字符集和排序
- [字符集和排序规则](/character-set-and-collation.md)
- [GBK 字符集](/charset-set-gbk.md)
- [Placement Rules in SQL](/placement-rules-in-sql.md)
- 系统表
- [`mysql`](/mysql-schema.md)
Expand Down
24 changes: 13 additions & 11 deletions character-set-and-collation.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,16 @@ SHOW CHARACTER SET;
```

```sql
+---------|---------------|-------------------|--------+
| Charset | Description | Default collation | Maxlen |
+---------|---------------|-------------------|--------+
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
| ascii | US ASCII | ascii_bin | 1 |
| latin1 | Latin1 | latin1_bin | 1 |
| binary | binary | binary | 1 |
+---------|---------------|-------------------|--------+
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| ascii | US ASCII | ascii_bin | 1 |
| binary | binary | binary | 1 |
| gbk | Chinese Internal Code Specification | gbk_bin | 2 |
| latin1 | Latin1 | latin1_bin | 1 |
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
+---------+-------------------------------------+-------------------+--------+
5 rows in set (0.00 sec)
```

Expand All @@ -80,6 +81,7 @@ mysql> show collation;
| binary | binary | 63 | Yes | Yes | 1 |
| ascii_bin | ascii | 65 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | Yes | Yes | 1 |
| gbk_bin | gbk | 87 | Yes | Yes | 1 |
+-------------+---------+------+---------+----------+---------+
5 rows in set (0.01 sec)
```
Expand Down Expand Up @@ -418,9 +420,9 @@ SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled
1 row in set (0.00 sec)
```

在新的排序规则框架下,TiDB 能够支持 `utf8_general_ci``utf8mb4_general_ci``utf8_unicode_ci` `utf8mb4_unicode_ci` 这几种排序规则,与 MySQL 兼容。
在新的排序规则框架下,TiDB 能够支持 `utf8_general_ci``utf8mb4_general_ci``utf8_unicode_ci``utf8mb4_unicode_ci``gbk_chinese_ci` `gbk_bin` 这几种排序规则,与 MySQL 兼容。

使用 `utf8_general_ci``utf8mb4_general_ci``utf8_unicode_ci` `utf8mb4_unicode_ci` 中任一种时,字符串之间的比较是大小写不敏感 (case-insensitive) 和口音不敏感 (accent-insensitive) 的。同时,TiDB 还修正了排序规则的 `PADDING` 行为:
使用 `utf8_general_ci``utf8mb4_general_ci``utf8_unicode_ci` `utf8mb4_unicode_ci``gbk_chinese_ci` 中任一种时,字符串之间的比较是大小写不敏感 (case-insensitive) 和口音不敏感 (accent-insensitive) 的。同时,TiDB 还修正了排序规则的 `PADDING` 行为:

{{< copyable "sql" >}}

Expand Down
108 changes: 108 additions & 0 deletions character-set-gbk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: GBK 字符集
---

# GBK 字符集

本文档介绍 TiDB 对 GBK 字符集的支持情况。

目前 TiDB 支持 GBK 字符集和排序规则情况:
```sql
SHOW CHARACTER SET WHERE CHARSET = 'gbk';;
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| gbk | Chinese Internal Code Specification | gbk_bin | 2 |
+---------+-------------------------------------+-------------------+--------+
1 row in set (0.00 sec)


SHOW COLLATION WHERE CHARSET = 'gbk';
+----------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+------+---------+----------+---------+
| gbk_bin | gbk | 87 | | Yes | 1 |
+----------------+---------+------+---------+----------+---------+
2 rows in set (0.00 sec)
```
TiDB GBK 字符集默认排序规则与 MySQL 不一致,MySQL 的是 `gbk_chinese_ci`. 另外,目前 TiDB 支持的 `gbk_bin` 与 MySQL 使用的也不一致。TiDB 这边排序规则是在将 GBK 转换成 UTF8MB4 然后做二进制排序。
如果需要兼容 MySQL 的排序规则,需要使用[新的排序规则框架](/character-set-and-collation.md#新框架下的排序规则支持)

利用以下的语句可以查看字符集对应的排序规则(以下是[新的排序规则框架](/character-set-and-collation.md#新框架下的排序规则支持))下的结果:

```sql
SHOW CHARACTER SET WHERE CHARSET = 'gbk';;
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
+---------+-------------------------------------+-------------------+--------+
1 row in set (0.00 sec)

SHOW COLLATION WHERE CHARSET = 'gbk';
+----------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+------+---------+----------+---------+
| gbk_bin | gbk | 87 | | Yes | 1 |
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 |
+----------------+---------+------+---------+----------+---------+
2 rows in set (0.00 sec)
```

## MySQL 兼容性

### 非法字符兼容性

* 在设置成 `character_set_client = gbk``character_set_connection = utf8mb4``character_set_table = gbk` 的情况下,TiDB 处理非法字符的情况与 MySQL 兼容。
* 当设置 `names``gbk` 后 TiDB 处理非法字符的情况与 MySQL 有一些不一样。具体行为如下(其中 `insert` 语句所在表的建表语句:`create table gbk_table(a varchar(32) character set gbk);`):

| DB | SQL_Mode | (Compatible) | (Imcompatible) |
|---------|----------------------|--------------------------------------|--------------------------------------|
| | | character_set_client = gbk <br> | character_set_client = gbk |
| / | / | character_set_connection = utf8mb4 | character_set_connection = gbk |
| | | character_set_table = gbk | character_set_client = gbk |
| | | (Compatible) | (Imcompatible) |
|---------|----------------------|--------------------------------------|--------------------------------------|
| MySQL | STRICT_ALL_TABLES or | select hex('一a') (0xe4b88061) | select hex('一a') (0xe4b88061) |
| | STRICT_TRANS_TABLES | e6b6933f | e4b88061 |
| | | | |
| | | insert into gbk_table values('一a'); | insert into gbk_table values('一a'); |
| | | select hex(a) from gbk_table; | Incorrect Error |
| | | e4b83f | |
|---------|----------------------|--------------------------------------|--------------------------------------|
| TiDB | STRICT_ALL_TABLES or | select hex('一a') (0xe4b88061) | select hex('一a') (0xe4b88061) |
| | STRICT_TRANS_TABLES | e6b6933f | Incorrect Error |
| | | | |
| | | insert into gbk_table values('一a'); | insert into gbk_table values('一a'); |
| | | select hex(a) from gbk_table; | Incorrect Error |
| | | e4b83f | |
|---------|----------------------|--------------------------------------|--------------------------------------|
| MySQL | not include | select hex('一a') (0xe4b88061) | select hex('一a') (0xe4b88061) |
| | STRICT_ALL_TABLES or | e6b6933f | e4b88061 |
| | STRICT_TRANS_TABLES | | |
| | | insert into gbk_table values('一a'); | insert into gbk_table values('一a'); |
| | | select hex(a) from gbk_table; | select hex(a) from gbk_table; |
| | | e4b83f | e4b8 |
|---------|----------------------|--------------------------------------|--------------------------------------|
| TiDB | not include | select hex('一a') (0xe4b88061) | select hex('一a') (0xe4b88061) |
| | STRICT_ALL_TABLES or | e6b6933f | e4b83f |
| | STRICT_TRANS_TABLES | | |
| | | insert into gbk_table values('一a'); | insert into gbk_table values('一a'); |
| | | select hex(a) from gbk_table; | select hex(a) from gbk_table; |
| | | e4b83f | e4b83f |



### 其它

* 目前不支持通过 `alter table` 语句变更字符集类型。

* 不支持使用 `_gbk`, 比如:

```sql
create table t(a char(10) charset binary);
Query OK, 0 rows affected (0.00 sec)
insert into t values (_gbk'');
ERROR 1115 (42000): Unsupported character introducer: 'gbk'
```

2 changes: 1 addition & 1 deletion mysql-compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ aliases: ['/docs-cn/dev/mysql-compatibility/','/docs-cn/dev/reference/mysql-comp
* 自定义函数
* 外键约束 [#18209](https://github.com/pingcap/tidb/issues/18209)
* 全文/空间函数与索引 [#1793](https://github.com/pingcap/tidb/issues/1793)
*`ascii`/`latin1`/`binary`/`utf8`/`utf8mb4` 的字符集
*`ascii`/`latin1`/`binary`/`utf8`/`utf8mb4`/`gbk` 的字符集
* SYS schema
* MySQL 追踪优化器
* XML 函数
Expand Down

0 comments on commit 01f4725

Please sign in to comment.