-
Notifications
You must be signed in to change notification settings - Fork 2.3k
mysqlctl,vttablet,vtbackup: support restore tablet and vtbackup restore phase with CLONE #19084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add core infrastructure for MySQL CLONE REMOTE operations: - CloneExecutor in mysqlctl/clone.go for executing CLONE INSTANCE FROM - Clone user configuration in dbconfigs (--db-clone-user, --db-clone-password, --db-clone-use-ssl) - vt_clone user in init_db.sql with BACKUP_ADMIN privilege (MySQL 8.0.17+ only) - Clone plugin loading in mycnf configs (plugin-load-add = mysql_clone.so) - MySQLClonePluginFlavorCapability for version checking MySQL CLONE copies data at the physical level over the network, providing a faster alternative to logical backup/restore for replica provisioning. Requires MySQL 8.0.17+ and InnoDB-only tables. Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Max Englander <[email protected]>
…o be marked as completed after the CLONE is done Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Nick Van Wiggeren <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
6ee25db to
cf1a804
Compare
| // and returns whether a restore action should be performed | ||
| func ShouldRestore(ctx context.Context, params RestoreParams) (bool, error) { | ||
| if params.DeleteBeforeRestore || RestoreWasInterrupted(params.Cnf) { | ||
| func ShouldRestore(ctx context.Context, logger logutil.Logger, cnf *Mycnf, mysqld MysqlDaemon, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes allow this to be called from both tabletmanager.restoreFromClone() and tabletmanager.RestoreBackup() without the former needing to construct RestoreParams, which is otherwise not needed by restoreFromClone().
…9089) Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
mattlord
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! ❤️ Just some relatively minor comments/questions/requests. Please let me know what you think.
| // restoreFromBackup will just be a regular action | ||
| // (same as if it was triggered remotely) | ||
| if err := tm.RestoreData(ctx, logutil.NewConsoleLogger(), waitForBackupInterval, false /* deleteBeforeRestore */, backupTime, restoreToTimestamp, restoreToPos, restoreFromBackupAllowedEngines, mysqlShutdownTimeout); err != nil { | ||
| if err := tm.RestoreBackup(ctx, logutil.NewConsoleLogger(), waitForBackupInterval, false /* deleteBeforeRestore */, backupTime, restoreToTimestamp, restoreToPos, restoreFromBackupAllowedEngines, mysqlShutdownTimeout); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we pass in the logger from line 900 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure..it was already like this. want me to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't have the logger variable before. I think we should use it here too now that we have added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, will do
| // Initialize shard primary. | ||
| err := localCluster.VtctldClientProcess.InitShardPrimary(keyspaceName, shardName, cell, primary.TabletUID) | ||
| require.NoError(t, err) | ||
|
|
||
| // Now check if MySQL version supports clone (need vttablet running to query). | ||
| if !mysqlVersionSupportsClone(t, primary) { | ||
| t.Skip("Skipping clone test: MySQL version does not support CLONE (requires 8.0.17+)") | ||
| } | ||
|
|
||
| // Check if clone plugin is available. | ||
| if !clonePluginAvailable(t, primary) { | ||
| t.Skip("Skipping clone test: clone plugin not available") | ||
| } | ||
|
|
||
| // Set up clean test data (table may have data from previous tests). | ||
| _, err = primary.VttabletProcess.QueryTablet(vtInsertTest, keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("TRUNCATE TABLE vt_insert_test", keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("insert into vt_insert_test (msg) values ('clone_test_1')", keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("insert into vt_insert_test (msg) values ('clone_test_2')", keyspaceName, true) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to be reliable, you have to explicitly disable vtorc using DisableVTOrcRecoveries(t). The tests fail locally for me, and I think it's because vtorc is running and competing with InitShardPrimary:
❯ go test -v -count=1 -timeout 30m vitess.io/vitess/go/test/endtoend/backup/clone
E0115 05:15:03.786892 68921 vtorc_process.go:113] configuration - {
"instance-poll-time": "10h"
}
=== RUN TestCloneBackup
Flag --init_db_sql_file has been deprecated, use --init-db-sql-file instead
Flag --init_keyspace has been deprecated, use --init-keyspace instead
Flag --allow_first_backup has been deprecated, use --allow-first-backup instead
Flag --mysql_socket has been deprecated, use --mysql-socket instead
--- PASS: TestCloneBackup (29.37s)
=== RUN TestCloneRestore
restore_test.go:65:
Error Trace: /Users/matt/git/vitess/go/test/endtoend/backup/clone/restore_test.go:65
Error: Received unexpected error:
The MySQL server is running with the --super-read-only option so it cannot execute this statement (errno 1290) (sqlstate HY000) during query: create table if not exists vt_insert_test (
id bigint auto_increment,
msg varchar(64),
primary key (id)
) Engine=InnoDB
Test: TestCloneRestore
E0115 05:15:47.389171 68921 vtctldclient_process.go:72] Output:
E0115 05:15:47.384561 72793 main.go:62] rpc error: code = Unknown desc = node doesn't exist: /zone1/tablets/zone1-0000006962/Tablet: while deleting 1 tablets; please inspect the topo
--- FAIL: TestCloneRestore (14.26s)
FAIL
FAIL vitess.io/vitess/go/test/endtoend/backup/clone 59.062s
FAIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
| // Give replica2 time to catch up via replication. | ||
| time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sorts of sleeps where we assume some state will be in place within a given timeframe have proven to lead to flakiness. Can we instead use require.Eventually using a query to check for the expected number of rows or for the replication state? I think we can just use require.Eventually with the VerifyRowsInTablet check or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the change. require.Eventually still requires to pass in a duration to wait for. are you thinking i should set that to higher than 5 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually set it well beyond what I would reasonably expect, since sometimes runners are resource starved and have long pauses etc and there's literally nothing we can do but re-run/re-try the workflow when the test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool i set the wait time pretty high
| func TestCloneRestore(t *testing.T) { | ||
| t.Cleanup(func() { removeBackups(t) }) | ||
| t.Cleanup(tearDownRestoreTest) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note here about disable vtorc global recoveries so that it's not racing/competing with the InitShardPrimary call to set the primary.
| require.NoError(t, err) | ||
|
|
||
| // Wait for replica1 to catch up. | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note here about using require.Eventually to check for the state we are only assuming we have after 2 seconds here.
|
|
||
| // Now check if MySQL version supports clone (need vttablet running to query). | ||
| if !mysqlVersionSupportsClone(t, primary) { | ||
| t.Skip("Skipping clone test: MySQL version does not support CLONE (requires 8.0.17+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of the new e2e tests... I think that we should return an error if we are running in the CI (CI environment variable is set), and only skip otherwise. This prevents us from unknowingly not testing things in the CI for some odd reason.
| time.Sleep(5 * time.Second) | ||
|
|
||
| cluster.VerifyRowsInTablet(t, replica2, keyspaceName, 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note here about using require.Eventually to check of the rows.
| // verifyClonedData checks that the specific test data we inserted on primary | ||
| // exists on the cloned replica. This proves data was actually transferred. | ||
| func verifyClonedData(t *testing.T, tablet *cluster.Vttablet) { | ||
| qr, err := tablet.VttabletProcess.QueryTablet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the testing.T here so we could push the require.Eventually down here. Same thing for the other verify helpers here.
|
@Promptless can you please create a new suggestion for documenting this PR along with everything else recently added for the MySQL CLONE backup/restore feature? |
Signed-off-by: Max Englander <[email protected]>
| --log_dir string If non-empty, write log files in this directory | ||
| --logtostderr log to standard error instead of files | ||
| --max-stack-size int configure the maximum stack size in bytes (default 67108864) | ||
| --mysql-clone-enabled Enable MySQL CLONE plugin and user for backup/replica provisioning (requires MySQL 8.0.17+) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was added in an earlier PR, i don't think it's relevant for this binary
| --log_dir string If non-empty, write log files in this directory | ||
| --logtostderr log to standard error instead of files | ||
| --max-stack-size int configure the maximum stack size in bytes (default 67108864) | ||
| --mysql-clone-enabled Enable MySQL CLONE plugin and user for backup/replica provisioning (requires MySQL 8.0.17+) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as comment above
| utils.SetFlagStringVar(fs, &mycnfTemplateFile, "mysqlctl-mycnf-template", mycnfTemplateFile, "template file to use for generating the my.cnf file during server init") | ||
| utils.SetFlagStringVar(fs, &socketFile, "mysqlctl-socket", socketFile, "socket file to use for remote mysqlctl actions (empty for local actions)") | ||
| utils.SetFlagDurationVar(fs, &replicationConnectRetry, "replication-connect-retry", replicationConnectRetry, "how long to wait in between replica reconnect attempts. Only precise to the second.") | ||
| utils.SetFlagBoolVar(fs, &mysqlCloneEnabled, "mysql-clone-enabled", mysqlCloneEnabled, "Enable MySQL CLONE plugin and user for backup/replica provisioning (requires MySQL 8.0.17+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo since this isn't used by mysqlctl or mysqlctld it shouldn't be grouped here.
| // Initialize shard primary. | ||
| err := localCluster.VtctldClientProcess.InitShardPrimary(keyspaceName, shardName, cell, primary.TabletUID) | ||
| require.NoError(t, err) | ||
|
|
||
| // Now check if MySQL version supports clone (need vttablet running to query). | ||
| if !mysqlVersionSupportsClone(t, primary) { | ||
| t.Skip("Skipping clone test: MySQL version does not support CLONE (requires 8.0.17+)") | ||
| } | ||
|
|
||
| // Check if clone plugin is available. | ||
| if !clonePluginAvailable(t, primary) { | ||
| t.Skip("Skipping clone test: clone plugin not available") | ||
| } | ||
|
|
||
| // Set up clean test data (table may have data from previous tests). | ||
| _, err = primary.VttabletProcess.QueryTablet(vtInsertTest, keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("TRUNCATE TABLE vt_insert_test", keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("insert into vt_insert_test (msg) values ('clone_test_1')", keyspaceName, true) | ||
| require.NoError(t, err) | ||
| _, err = primary.VttabletProcess.QueryTablet("insert into vt_insert_test (msg) values ('clone_test_2')", keyspaceName, true) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
| // Give replica2 time to catch up via replication. | ||
| time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the change. require.Eventually still requires to pass in a duration to wait for. are you thinking i should set that to higher than 5 seconds?
| func parseVersionFromRow(row []sqltypes.Value) (int, int, int, error) { | ||
| if len(row) == 0 { | ||
| return 0, 0, 0, errors.New("empty row") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must have been leftover from an earlier iteration, it's unused. removing.
| } | ||
| } | ||
|
|
||
| // splitLines splits a string by newlines, filtering out empty lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know! switched to strings.SplitSeq and it's working fine, removing this.
| // restoreFromBackup will just be a regular action | ||
| // (same as if it was triggered remotely) | ||
| if err := tm.RestoreData(ctx, logutil.NewConsoleLogger(), waitForBackupInterval, false /* deleteBeforeRestore */, backupTime, restoreToTimestamp, restoreToPos, restoreFromBackupAllowedEngines, mysqlShutdownTimeout); err != nil { | ||
| if err := tm.RestoreBackup(ctx, logutil.NewConsoleLogger(), waitForBackupInterval, false /* deleteBeforeRestore */, backupTime, restoreToTimestamp, restoreToPos, restoreFromBackupAllowedEngines, mysqlShutdownTimeout); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure..it was already like this. want me to change it?
Signed-off-by: Max Englander <[email protected]>
Description
Next step after #19064. Adapted from @nickvanw code in private fork.
--restore-with-clone.--restore-with-cloneusestabletmanager.restoreFromClone().restoreFromClone()andRestoreBackup()use shared functions extracted fromRestoreBackup.Related Issue(s)
Checklist
Deployment Notes
AI Disclosure
AI was used.