title | summary |
---|---|
Migrate and Merge MySQL Shards of Large Datasets to TiDB Cloud |
Learn how to migrate and merge MySQL shards of large datasets to TiDB Cloud. |
This document describes how to migrate and merge a large MySQL dataset (for example, more than 1 TiB) from different partitions into TiDB Cloud. After full data migration, you can use TiDB Data Migration (DM) to perform incremental migration according to your business needs.
The example in this document uses a complex shard migration task across multiple MySQL instances, and involves handling conflicts in auto-increment primary keys. The scenario in this example is also applicable to merging data from different sharded tables within a single MySQL instance.
This section describes the basic information of the upstream cluster, DM, and downstream cluster used in the example.
The environment information of the upstream cluster is as follows:
-
MySQL version: MySQL v5.7.18
-
MySQL instance1:
- schema
store_01
and table[sale_01, sale_02]
- schema
store_02
and table[sale_01, sale_02]
- schema
-
MySQL instance 2:
- schema
store_01
and table[sale_01, sale_02]
- schema
store_02
and table[sale_01, sale_02]
- schema
-
Table structure:
CREATE TABLE sale_01 ( id bigint(20) NOT NULL auto_increment, uid varchar(40) NOT NULL, sale_num bigint DEFAULT NULL, PRIMARY KEY (id), UNIQUE KEY ind_uid (uid) );
The version of DM is v5.3.0. You need to deploy TiDB DM manually. For detailed steps, see Deploy a DM Cluster Using TiUP.
This document uses the Amazon S3 as an example.
The sharded schemas and tables are merged into the table store.sales
.
The following is the procedure to migrate and merge full data of MySQL shards to TiDB Cloud.
In the following example, you only need to export the data in tables to CSV format.
Create a first-level directory store
(corresponding to the level of databases) and a second-level directory sales
(corresponding to the level of tables) in the Amazon S3 bucket. In sales
, create a third-level directory for each MySQL instance (corresponding to the level of MySQL instances). For example:
- Migrate the data in MySQL instance1 to
s3://dumpling-s3/store/sales/instance01/
- Migrate the data in MySQL instance2 to
s3://dumpling-s3/store/sales/instance02/
If there are shards across multiple instances, you can create one first-level directory for each database and create one second-level directory for each sharded table. Then create a third-level directory for each MySQL instance for easy management. For example, if you want to migrate and merge tables stock_N.product_N
from MySQL instance1 and MySQL instance2 into the table stock.products
in TiDB Cloud, you can create the following directories:
s3://dumpling-s3/stock/products/instance01/
s3://dumpling-s3/stock/products/instance02/
For information about how to install Dumpling, see Dumpling Introduction.
When you use Dumpling to export data to Amazon S3, note the following:
- Enable binlog for upstream clusters.
- Choose the correct Amazon S3 directory and region.
- Choose the appropriate concurrency by configuring the
-t
option to minimize the impact on the upstream cluster, or export directly from the backup database. For more information about how to use this parameter, see Option list of Dumpling. - Set appropriate values for
--filetype csv
and--no-schemas
. For more information about how to use these parameters, see Option list of Dumpling.
Name the CSV files as follows:
- If the data of one table is separated into multiple CSV files, append a numeric suffix to these CSV files. For example,
${db_name}.${table_name}.000001.csv
and${db_name}.${table_name}.000002.csv
. The numeric suffixes can be inconsecutive but must be in ascending order. You also need to add extra zeros before the number to ensure all the suffixes are in the same length.
Note:
If you cannot update the CSV filenames according to the preceding rules in some cases (for example, the CSV file links are also used by your other programs), you can keep the filenames unchanged and use the Mapping Settings in Step 5 to import your source data to a single target table.
To export data to Amazon S3, do the following:
-
Get the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
of the Amazon S3 bucket.[root@localhost ~]# export AWS_ACCESS_KEY_ID={your_aws_access_key_id} [root@localhost ~]# export AWS_SECRET_ACCESS_KEY= {your_aws_secret_access_key}
-
Export data from MySQL instance1 to the
s3://dumpling-s3/store/sales/instance01/
directory in the Amazon S3 bucket.[root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql01-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance01/" --s3.region "ap-northeast-1"
For more information about the parameters, see Option list of Dumpling.
-
Export data from MySQL instance2 to the
s3://dumpling-s3/store/sales/instance02/
directory in the Amazon S3 bucket.[root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql02-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance02/" --s3.region "ap-northeast-1"
For detailed steps, see Export data to Amazon S3 cloud storage.
Create schemas in the TiDB Cloud cluster as follows:
mysql> CREATE DATABASE store;
Query OK, 0 rows affected (0.16 sec)
mysql> use store;
Database changed
In this example, the column IDs of the upstream tables sale_01
and sale_02
are auto-increment primary keys. Conflicts might occur when you merge sharded tables in the downstream database. Execute the following SQL statement to set the ID column as a normal index instead of a primary key:
mysql> CREATE TABLE `sales` (
`id` bigint(20) NOT NULL ,
`uid` varchar(40) NOT NULL,
`sale_num` bigint DEFAULT NULL,
INDEX (`id`),
UNIQUE KEY `ind_uid` (`uid`)
);
Query OK, 0 rows affected (0.17 sec)
For more information about the solutions to solve such conflicts, see Remove the PRIMARY KEY attribute from the column.
Follow the instructions in Configure Amazon S3 access to get the role ARN to access the source data.
The following example only lists key policy configurations. Replace the Amazon S3 path with your own values.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::dumpling-s3/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::dumpling-s3"
}
]
}
After configuring the Amazon S3 access, you can perform the data import task in the TiDB Cloud console as follows:
-
Open the Import page for your target cluster.
-
Log in to the TiDB Cloud console and navigate to the Clusters page of your project.
Tip:
If you have multiple projects, you can click in the lower-left corner and switch to another project.
-
Click the name of your target cluster to go to its overview page, and then click Import in the left navigation pane.
-
-
Select Import data from S3.
If this is your first time importing data into this cluster, select Import From Amazon S3.
-
On the Import Data from Amazon S3 page, fill in the following information:
- Import File Count: select Multiple files.
- Included Schema Files: select No.
- Data Format: select CSV.
- Folder URI: fill in the bucket URI of your source data. You can use the second-level directory corresponding to tables,
s3://dumpling-s3/store/sales/
in this example, so that TiDB Cloud can import and merge the data in all MySQL instances intostore.sales
in one go. - Bucket Access > AWS Role ARN: enter the Role-ARN you obtained.
If the location of the bucket is different from your cluster, confirm the compliance of cross region.
TiDB Cloud starts validating whether it can access your data in the specified bucket URI. After validation, TiDB Cloud tries to scan all the files in the data source using the default file naming pattern, and returns a scan summary result on the left side of the next page. If you get the
AccessDenied
error, see Troubleshoot Access Denied Errors during Data Import from S3. -
Click Connect.
-
In the Destination section, select the target database and table.
When importing multiple files, you can use Advanced Settings > Mapping Settings to define a custom mapping rule for each target table and its corresponding CSV file. After that, the data source files will be re-scanned using the provided custom mapping rule.
When you enter the source file URI and name in Source File URIs and Names, make sure it is in the following format
s3://[bucket_name]/[data_source_folder]/[file_name].csv
. For example,s3://sampledata/ingest/TableName.01.csv
.You can also use wildcards to match the source files. For example:
-
s3://[bucket_name]/[data_source_folder]/my-data?.csv
: all CSV files starting withmy-data
followed by one character (such asmy-data1.csv
andmy-data2.csv
) in that folder will be imported into the same target table. -
s3://[bucket_name]/[data_source_folder]/my-data*.csv
: all CSV files in the folder starting withmy-data
will be imported into the same target table.
Note that only
?
and*
are supported.Note:
The URI must contain the data source folder.
-
-
Edit the CSV configuration if needed.
You can also click Edit CSV configuration to configure Backslash Escape, Separator, and Delimiter for more fine-grained control.
Note:
For the configurations of separator, delimiter, and null, you can use both alphanumeric characters and certain special characters. The supported special characters include
\t
,\b
,\n
,\r
,\f
, and\u0001
. -
Click Start Import.
-
When the import progress shows Completed, check the imported tables.
After the data is imported, if you want to remove the Amazon S3 access of TiDB Cloud, simply delete the policy that you added.
To replicate the data changes based on binlog from a specified position in the upstream cluster to TiDB Cloud, you can use TiDB Data Migration (DM) to perform incremental replication.
If you want to migrate incremental data and merge MySQL shards to TiDB Cloud, you need to manually deploy TiDB DM, because TiDB Cloud does not support migrating and merging MySQL shards yet. For detailed steps, see Deploy a DM Cluster Using TiUP.
-
Create a new data source file
dm-source1.yaml
to configure an upstream data source into DM. Add the following content:# MySQL Configuration. source-id: "mysql-replica-01" # Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier). # The prerequisite is that you have already enabled GTID in the upstream MySQL. # If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID. enable-gtid: true from: host: "${host}" # For example: 192.168.10.101 user: "user01" password: "${password}" # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords. port: ${port} # For example: 3307
-
Create another new data source file
dm-source2.yaml
, and add the following content:# MySQL Configuration. source-id: "mysql-replica-02" # Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier). # The prerequisite is that you have already enabled GTID in the upstream MySQL. # If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID. enable-gtid: true from: host: "192.168.10.102" user: "user02" password: "${password}" port: 3308
-
Run the following command in a terminal. Use
tiup dmctl
to load the first data source configuration into the DM cluster:[root@localhost ~]# tiup dmctl --master-addr ${advertise-addr} operate-source create dm-source1.yaml
The parameters used in the command above are described as follows:
Parameter Description --master-addr
The {advertise-addr}
of any DM-master node in the cluster wheredmctl
is to be connected. For example: 192.168.11.110:9261operate-source create
Loads the data source to the DM cluster. The following is an example output:
tiup is checking updates for component dmctl ... Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source1.yaml { "result": true, "msg": "", "sources": [ { "result": true, "msg": "", "source": "mysql-replica-01", "worker": "dm-192.168.11.111-9262" } ] }
-
Run the following command in a terminal. Use
tiup dmctl
to load the second data source configuration into the DM cluster:[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml
The following is an example output:
tiup is checking updates for component dmctl ... Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml { "result": true, "msg": "", "sources": [ { "result": true, "msg": "", "source": "mysql-replica-02", "worker": "dm-192.168.11.112-9262" } ] }
-
Create a
test-task1.yaml
file for the replication task. -
Find the starting point in the metadata file of MySQL instance1 exported by Dumpling. For example:
Started dump at: 2022-05-25 10:16:26 SHOW MASTER STATUS: Log: mysql-bin.000002 Pos: 246546174 GTID:b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801 Finished dump at: 2022-05-25 10:16:27
-
Find the starting point in the metadata file of MySQL instance2 exported by Dumpling. For example:
Started dump at: 2022-05-25 10:20:32 SHOW MASTER STATUS: Log: mysql-bin.000001 Pos: 1312659 GTID:cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036 Finished dump at: 2022-05-25 10:20:32
-
Edit the task configuration file
test-task1
, to configure the incremental replication mode and replication starting point for each data source.## ********* Task Configuration ********* name: test-task1 shard-mode: "pessimistic" # Task mode. The "incremental" mode only performs incremental data migration. task-mode: incremental # timezone: "UTC" ## ******** Data Source Configuration ********** ## (Optional) If you need to incrementally replicate data that has already been migrated in the full data migration, you need to enable the safe mode to avoid the incremental data migration error. ## This scenario is common in the following case: the full migration data does not belong to the data source's consistency snapshot, and after that, DM starts to replicate incremental data from a position earlier than the full migration. syncers: # The running configurations of the sync processing unit. global: # Configuration name. safe-mode: false # # If this field is set to true, DM changes INSERT of the data source to REPLACE for the target database, # # and changes UPDATE of the data source to DELETE and REPLACE for the target database. # # This is to ensure that when the table schema contains a primary key or unique index, DML statements can be imported repeatedly. # # In the first minute of starting or resuming an incremental migration task, DM automatically enables the safe mode. mysql-instances: - source-id: "mysql-replica-01" block-allow-list: "bw-rule-1" route-rules: ["store-route-rule", "sale-route-rule"] filter-rules: ["store-filter-rule", "sale-filter-rule"] syncer-config-name: "global" meta: binlog-name: "mysql-bin.000002" binlog-pos: 246546174 binlog-gtid: "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801" - source-id: "mysql-replica-02" block-allow-list: "bw-rule-1" route-rules: ["store-route-rule", "sale-route-rule"] filter-rules: ["store-filter-rule", "sale-filter-rule"] syncer-config-name: "global" meta: binlog-name: "mysql-bin.000001" binlog-pos: 1312659 binlog-gtid: "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036" ## ******** Configuration of the target TiDB cluster on TiDB Cloud ********** target-database: # The target TiDB cluster on TiDB Cloud host: "tidb.xxxxxxx.xxxxxxxxx.ap-northeast-1.prod.aws.tidbcloud.com" port: 4000 user: "root" password: "${password}" # If the password is not empty, it is recommended to use a dmctl-encrypted cipher. ## ******** Function Configuration ********** routes: store-route-rule: schema-pattern: "store_*" target-schema: "store" sale-route-rule: schema-pattern: "store_*" table-pattern: "sale_*" target-schema: "store" target-table: "sales" filters: sale-filter-rule: schema-pattern: "store_*" table-pattern: "sale_*" events: ["truncate table", "drop table", "delete"] action: Ignore store-filter-rule: schema-pattern: "store_*" events: ["drop database"] action: Ignore block-allow-list: bw-rule-1: do-dbs: ["store_*"] ## ******** Ignore check items ********** ignore-checking-items: ["table_schema","auto_increment_ID"]
For detailed task configurations, see DM Task Configurations.
To run a data replication task smoothly, DM triggers a precheck automatically at the start of the task and returns the check results. DM starts the replication only after the precheck is passed. To trigger a precheck manually, run the check-task command:
[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml
The following is an example output:
tiup is checking updates for component dmctl ...
Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml
{
"result": true,
"msg": "check pass!!!"
}
Use tiup dmctl
to run the following command to start the data replication task:
[root@localhost ~]# tiup dmctl --master-addr ${advertise-addr} start-task dm-task.yaml
The parameters used in the command above are described as follows:
Parameter | Description |
---|---|
--master-addr |
The {advertise-addr} of any DM-master node in the cluster where dmctl is to be connected. For example: 192.168.11.110:9261 |
start-task |
Starts the migration task. |
The following is an example output:
tiup is checking updates for component dmctl ...
Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 start-task dm-task.yaml
{
"result": true,
"msg": "",
"sources": [
{
"result": true,
"msg": "",
"source": "mysql-replica-01",
"worker": "dm-192.168.11.111-9262"
},
{
"result": true,
"msg": "",
"source": "mysql-replica-02",
"worker": "dm-192.168.11.112-9262"
}
],
"checkResult": ""
}
If the task fails to start, check the prompt message and fix the configuration. After that, you can re-run the command above to start the task.
If you encounter any problem, refer to DM error handling and DM FAQ.
To learn whether the DM cluster has an ongoing replication task and view the task status, run the query-status
command using tiup dmctl
:
[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 query-status test-task1
The following is an example output:
{
"result": true,
"msg": "",
"sources": [
{
"result": true,
"msg": "",
"sourceStatus": {
"source": "mysql-replica-01",
"worker": "dm-192.168.11.111-9262",
"result": null,
"relayStatus": null
},
"subTaskStatus": [
{
"name": "test-task1",
"stage": "Running",
"unit": "Sync",
"result": null,
"unresolvedDDLLockID": "",
"sync": {
"totalEvents": "4048",
"totalTps": "3",
"recentTps": "3",
"masterBinlog": "(mysql-bin.000002, 246550002)",
"masterBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
"syncerBinlog": "(mysql-bin.000002, 246550002)",
"syncerBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
"blockingDDLs": [
],
"unresolvedGroups": [
],
"synced": true,
"binlogType": "remote",
"secondsBehindMaster": "0",
"blockDDLOwner": "",
"conflictMsg": ""
}
}
]
},
{
"result": true,
"msg": "",
"sourceStatus": {
"source": "mysql-replica-02",
"worker": "dm-192.168.11.112-9262",
"result": null,
"relayStatus": null
},
"subTaskStatus": [
{
"name": "test-task1",
"stage": "Running",
"unit": "Sync",
"result": null,
"unresolvedDDLLockID": "",
"sync": {
"totalEvents": "33",
"totalTps": "0",
"recentTps": "0",
"masterBinlog": "(mysql-bin.000001, 1316487)",
"masterBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
"syncerBinlog": "(mysql-bin.000001, 1316487)",
"syncerBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
"blockingDDLs": [
],
"unresolvedGroups": [
],
"synced": true,
"binlogType": "remote",
"secondsBehindMaster": "0",
"blockDDLOwner": "",
"conflictMsg": ""
}
}
]
}
]
}
For a detailed interpretation of the results, see Query Status.