-
Notifications
You must be signed in to change notification settings - Fork 1
migrate: re-run post-step callbacks on error #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ll-fork
Are you sure you want to change the base?
migrate: re-run post-step callbacks on error #3
Conversation
ed3932c to
7599e4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Version offset approach seems very clean to me! Nothing jumps out as being unviable so far.
Can we add unit tests to this PR which exercise the changes you've made? Checks the version at failed callback. Check that callback is executed once more at failed migration etc.
General naming suggestion:
Instead of PostStepCallback, maybe we should consider using something shorter like task or hook. Just to reduce verbosity and simplify the naming as we add more to this functionality.
For example:
execTask(*Migration)vsexecutePostStepCallbackexecTaskAtMigVersion(int)vsexecutePostStepCallbackForSQLMigInTaskVersionRange(int)vsIsPostStepCallbackVersionTaskVersionOffsetvsPostStepCallbackOffset
If we already know that a task in this context is a callback which runs after a migration step, then we don't need to prefix everything with "PostStep".
We could even add a new tasks.go and tasks_test.go files for example where we can document this feature better and keep our changes minimal in migrate.go, util.go, and migrate_test.go.
7599e4c to
226c8d7
Compare
|
Thanks a lot for the review @ffranr 🙏!
I definietly agree that the name PostStepCallback is confusing, so I added some additional commits which renames it to I placed the main code for that struct/funcs in a new
Good idea :). I added a new |
ffranr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the name changes, much easier to read now IMO!
I'm not sure about this: https://github.com/lightninglabs/migrate/pull/3/files#r2413777619
migrate.go
Outdated
| return m.unlockErr(err) | ||
| } | ||
|
|
||
| curVersion, dirty, err = m.databaseDrv.Version() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file, I see several new calls like curVersion, dirty, err = m.databaseDrv.Version(). We don’t appear to check the dirty value for these new calls. Should we add something like this here and in similar spots?
if dirty {
return m.unlockErr(ErrDirty{curVersion})
}If not, we should replace the unused return values with _ and add a brief comment explaining why it’s safe to ignore them.
I wander if we shouldn't extract the whole if InTaskVersionRange(curVersion) { block into a new function since it's repeated ~5 times. Maybe we can call that function maybeExecTask or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke ut much of the repeated logic + a little more into a new ensureCleanCurrentSQLVersion to address this feedback :). Hope you like it!
migrate.go
Outdated
| // If the current version is a clean migration task version, then we | ||
| // need to rerun the task for the previous version before we can | ||
| // continue with any SQL migration(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add more to this comment.
// If the current version is a clean migration task version, then we
// need to rerun the task before we can
// continue with any SQL migration(s). This is because...
because the migration process ended cleanly but the task did not complete? We know that the task didn't complete because the version is a task version and not an SQL migrate version.
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
We know that the task didn't complete because the version is a task version and not an SQL migrate version.
Correct, and specifically also that the database is set to a clean state + a migration task version. That can only ever happen if the migration task was run but returned an error.
| "finished for %v\n", migr.LogString()) | ||
| err := m.execTask(migr) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding context message to this err using fmt.Errorf.
migrate.go
Outdated
| task, ok := m.opts.tasks[migr.Version] | ||
| if ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return early here if !ok to save on indent. Maybe something like:
task, ok := m.opts.tasks[migr.Version]
if !ok {
m.logVerbosePrintf("No migration task found for %v\n",
migr.LogString())
return nil
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks 🙏! Implemented :)
migrate.go
Outdated
|
|
||
| // Persist that we are in the migration task phase for this | ||
| // version. | ||
| if err := m.databaseDrv.SetVersion(taskVersion, true); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line and another below are over our line limit. I think the CI isn't configured to run the project's linter right now?
I think we can run the linter as currently setup for this project using command:
docker run --rm -v "$(pwd)":/app -w /app golangci/golangci-lint:v1.64.8 golangci-lint run --config .golangci.yml ./...
Doesn't seem to complain about line length, but gives output:
migrate.go:255:15: ineffectual assignment to dirty (ineffassign)
curVersion, dirty, err = m.databaseDrv.Version()
^
migrate.go:298:15: ineffectual assignment to dirty (ineffassign)
curVersion, dirty, err = m.databaseDrv.Version()
^
migrate.go:342:15: ineffectual assignment to dirty (ineffassign)
curVersion, dirty, err = m.databaseDrv.Version()
^
migrate.go:381:15: ineffectual assignment to dirty (ineffassign)
curVersion, dirty, err = m.databaseDrv.Version()
^
migrate.go:436:3: ineffectual assignment to curVersion (ineffassign)
curVersion, dirty, err = m.databaseDrv.Version()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've enabled actions/workflows on the repo, so maybe CI will run now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've enabled actions/workflows on the repo, so maybe CI will run now.
Awesome, thanks 🎉! The CI workflow seems to run now 🔥
Also addressed the feedback of this comment with the introduction of the function mentioned in #3 (comment)
migrate.go
Outdated
| err := task(migr, m.databaseDrv) | ||
| if err != nil { | ||
| // Mark the database version as the taskVersion but in a | ||
| // clean state, to indicate that the migration task | ||
| // errored. We will therefore re-run the task on the | ||
| // next migration run. | ||
| if setErr := m.databaseDrv.SetVersion(taskVersion, false); setErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in TAP code i see:
the callback function that should be
// run _after_ the step was executed (but before the version is marked as
// cleanly executed). An error returned from the callback will cause the
// migration to fail and the step to be marked as dirty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here? In other words, should we wrap the task callback in a single transaction enforced by the migrate package? Otherwise, could a task end up executing multiple migrations via m.databaseDrv, leading to non-atomic changes that we wouldn’t be able to roll back on error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?
The reason we mark it as clean is that this is the definition for the migration task version behaviour, i.e. that if the database version is currently set to a migration task version and clean, this is a guarantee that the migration task was executed with an error on the last attempt. If the database version is set to a migration task version in a dirty state, we cannot know for sure wether the migration task was previously executed successfully or not (hence, requiring the manual intervention).
In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here?
I agree and do think in most scenarios it makes sense to execute the migration task in a single transaction, but I don't think that's always guaranteed to be the desired behaviour. For example:
When lnd uses this functionality to migrate from kvdb to sql through a migration task, it's likely that this will be needed to be done in multiple database transactions as it would simply be too much data to migrate in a single database transaction. So ultimately I do think it should up to the implementer of the migration task to define if the migration task should be run in a single db transaction or not, and therefore I don't think this should be up to the implementation of the migrate library.
Additionally, I'm not sure that all database backends supported in the migrate library do actually support transactions + rollbacks. Therefore I think we can't actually implement this in the migrate library, and it must be up to the caller to decide how to handle this.
Therefore, I didn't address the feedback of guaranteeing the executing of a migration task in a single db transaction through the migrate library. Do you agree with my reasoning 🙏?
migrate.go
Outdated
|
|
||
| select { | ||
| case r = <-migRet: | ||
| case <-time.After(30 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a const for this timeout. Something like:
const (
// DeafultReadMigTimeout ...
DeafultReadMigTimeout = 30 * time.Second
)
migrate.go
Outdated
| // set clean state | ||
| if err = m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment here needs enhancement.
migrate.go
Outdated
| err = m.execTask(migr) | ||
| if err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add context to error here with fmt.Errorf?
1133fc1 to
b98dde9
Compare
GustavoStingelin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good in terms of implementation, but I’m still missing some context about the motivation behind it.
@ViktorT-11, could you share a bit more background or point to any previous discussion about the issue this aims to solve?
From what I understood, this might be related to migrations that go beyond schema changes and include data updates, which could partially complete and time out. In that case, re-running the migration would have less data to process and might succeed on a retry. If that’s the intent, I’d love to discuss if we could tackle this problem in other ways, for example:
- Making migrations more granular so each one processes smaller batches of data.
- Adjusting the default timeout or providing clearer guidance on how to configure it for larger nodes, or even making it dynamically configurable.
- Temporarily disabling constraints and using
CONCURRENTLYwhere it’s available.
So, at first glance, the timeout scenario seems to be the main one this change would address. Are there other use cases that motivated it?
|
@GustavoStingelin thanks again for looking through this PR as well 🙏!
Exactly. A migration task is a separate step that can execute arbitrary code, and it runs immediately after a specific SQL schema migration. This step is primarily intended to operate on the data in the database once we are certain the schema has a known structure. To make this more concrete, here is a real-world example of when we need such a task: We are currently working on adding support for migrating the database backends in The reason this kvdb-to-SQL migration must be performed as a migration task is that we need the SQL tables to exactly match the schema that existed when the kvdb-to-SQL migration code was written. If the schema has changed because of later SQL migrations that were added after the kvdb-to-SQL code was created, then the migration could fail due to incompatible table definitions. Therefore, the kvdb-to-SQL migration must run immediately after a specific SQL migration version (that is, as a migration task) and cannot be executed separately after all SQL schema migrations are complete. This explains why we need support for arbitrary migration tasks. For the kvdb-to-SQL example, we also need the ability to re-run failed tasks on the next startup, which is what this PR adds. Since it is very difficult to anticipate every possible edge case in user data for If we do not support retrying the task on the next startup, those users would not be able to re-trigger the migration even after we provide a hotfix for the kvdb-to-SQL code. The only alternative would be to have them manually modify their SQL database, which is far less ideal than simply retrying the migration on every startup until it succeeds. I hope this clarifies the motivation behind migration tasks and the added functionality to retry them on startup, which is what this PR introduces 🙏 |
|
Ok, thanks for sharing this context, concept ACK. I noticed that the tests are failing because we already have some migration files using versions like
I think option 2 is enough for now. It is not a permanent solution, but it gives plenty of time to revisit this later if needed. To reproduce the pipe error more easily: test data: |
|
Thanks for the really great feedback @GustavoStingelin 🙏! I've raised an internal discussion of how we want to tackle this issue, as it's unfortunately not as easy to solve except if we go for approach (1). The migration version is unfortunately being passed around as an I'll update you when we've reached a conclusion of how we want to tackle this issue. |
|
@ffranr: review reminder |
This commit modifies the migration framework to re-attempt migration tasks if they error during a migration, on the next run. Previously, if a migration task failed, but their associated SQL migration succeeded, the database version would be set to a dirty state, and require manual intervention in order to reset the SQL migration and re-attempt it + the migration task. The new re-attempt mechanism is achieved by ensuring that a migration can only be either an SQL migration or a migration task, but not both. This way, if a migration task errors, the database version will be reset to the previous version prior to executing the migration task, and the migration task will be re-attempted on the next run.
b98dde9 to
b2047a3
Compare
|
Updated the approach to achieve the re-run to the following logic: We now define a migration as that it can either be an SQL migration or a Migration task, but not both. If a migration task errors, we will reset the database version to the version it was set to before attempting to execute the migration task. Therefore, the migration task will be re-executed once more on the next startup. That completely removes the need for the "migration task offset" approach. I also suggest that we now rename the "Migration task"/"post migration step" to a "code migration" instead, as it better explains what that migration type actually is now that it cannot also be an SQL migration. Reviewers, do you agree with that reasoning? |
This commit modifies the migration framework to re-attempt post-step callbacks if they error during a migration, and won't leave the database version as
dirty.Implemenation approach:
This is achieved by introducing the concept of a "post-step callback" migration version. Post-step callbacks are their corresponding SQL migration version offset by +1000000000.
During the execution of a post-step callback, the post-step callback migration version will be persisted as the database version. That way, if the post-step callback errors, the version for the database will be the post-step callback version on the next startup. The post-step callback will then be re-attempted before proceeding with the next SQL migration.
Alternative approach:
Another solution to the issue, would be to individually track if a post migration has been executed successfully or not in a separate version table for post migrations. I chose not to go for that approach as that would require individual implementation for every database backend type and therefore be a much larger change.
I'm open to change the approach though if reviewers do not prefer the approach implemented by this PR.