Skip to content

Conversation

@ViktorT-11
Copy link

@ViktorT-11 ViktorT-11 commented Sep 25, 2025

This commit modifies the migration framework to re-attempt post-step callbacks if they error during a migration, and won't leave the database version as dirty.

Implemenation approach:
This is achieved by introducing the concept of a "post-step callback" migration version. Post-step callbacks are their corresponding SQL migration version offset by +1000000000.

During the execution of a post-step callback, the post-step callback migration version will be persisted as the database version. That way, if the post-step callback errors, the version for the database will be the post-step callback version on the next startup. The post-step callback will then be re-attempted before proceeding with the next SQL migration.

Alternative approach:
Another solution to the issue, would be to individually track if a post migration has been executed successfully or not in a separate version table for post migrations. I chose not to go for that approach as that would require individual implementation for every database backend type and therefore be a much larger change.

I'm open to change the approach though if reviewers do not prefer the approach implemented by this PR.

@ViktorT-11 ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from ed3932c to 7599e4c Compare September 25, 2025 09:53
@levmi levmi requested review from ellemouton and ffranr September 25, 2025 14:58
Copy link

@ffranr ffranr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Version offset approach seems very clean to me! Nothing jumps out as being unviable so far.

Can we add unit tests to this PR which exercise the changes you've made? Checks the version at failed callback. Check that callback is executed once more at failed migration etc.

General naming suggestion:

Instead of PostStepCallback, maybe we should consider using something shorter like task or hook. Just to reduce verbosity and simplify the naming as we add more to this functionality.

For example:

  • execTask(*Migration) vs executePostStepCallback
  • execTaskAtMigVersion(int) vs executePostStepCallbackForSQLMig
  • InTaskVersionRange(int) vs IsPostStepCallbackVersion
  • TaskVersionOffset vs PostStepCallbackOffset

If we already know that a task in this context is a callback which runs after a migration step, then we don't need to prefix everything with "PostStep".

We could even add a new tasks.go and tasks_test.go files for example where we can document this feature better and keep our changes minimal in migrate.go, util.go, and migrate_test.go.

@ViktorT-11 ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from 7599e4c to 226c8d7 Compare September 30, 2025 17:53
@ViktorT-11
Copy link
Author

ViktorT-11 commented Sep 30, 2025

Thanks a lot for the review @ffranr 🙏!

Instead of PostStepCallback, maybe we should consider using something shorter like task

I definietly agree that the name PostStepCallback is confusing, so I added some additional commits which renames it to task/MigrationTask. Happy to address any feedback there if you still don't agree with my naming :).

I placed the main code for that struct/funcs in a new tasks.go function. I opted to not move the migration tests for the MigrationTask migration to a task_test.go file though, as I do believe it amkes more sense to keep all migration tests in the same file.

Can we add unit tests to this PR which exercise the changes you've made?

Good idea :). I added a new TestMigrationTaskError to the migrate_test.go file to address this!

Copy link

@ffranr ffranr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the name changes, much easier to read now IMO!

I'm not sure about this: https://github.com/lightninglabs/migrate/pull/3/files#r2413777619

migrate.go Outdated
return m.unlockErr(err)
}

curVersion, dirty, err = m.databaseDrv.Version()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, I see several new calls like curVersion, dirty, err = m.databaseDrv.Version(). We don’t appear to check the dirty value for these new calls. Should we add something like this here and in similar spots?

if dirty {
	return m.unlockErr(ErrDirty{curVersion})
}

If not, we should replace the unused return values with _ and add a brief comment explaining why it’s safe to ignore them.

I wander if we shouldn't extract the whole if InTaskVersionRange(curVersion) { block into a new function since it's repeated ~5 times. Maybe we can call that function maybeExecTask or something like that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I broke ut much of the repeated logic + a little more into a new ensureCleanCurrentSQLVersion to address this feedback :). Hope you like it!

migrate.go Outdated
Comment on lines 244 to 246
// If the current version is a clean migration task version, then we
// need to rerun the task for the previous version before we can
// continue with any SQL migration(s).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add more to this comment.

	// If the current version is a clean migration task version, then we
	// need to rerun the task before we can
	// continue with any SQL migration(s). This is because... 

because the migration process ended cleanly but the task did not complete? We know that the task didn't complete because the version is a task version and not an SQL migrate version.

Is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

We know that the task didn't complete because the version is a task version and not an SQL migrate version.

Correct, and specifically also that the database is set to a clean state + a migration task version. That can only ever happen if the migration task was run but returned an error.

"finished for %v\n", migr.LogString())
err := m.execTask(migr)
if err != nil {
return err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding context message to this err using fmt.Errorf.

migrate.go Outdated
Comment on lines 915 to 916
task, ok := m.opts.tasks[migr.Version]
if ok {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return early here if !ok to save on indent. Maybe something like:

task, ok := m.opts.tasks[migr.Version]
if !ok {
  m.logVerbosePrintf("No migration task found for %v\n",
  			migr.LogString())
  return nil
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks 🙏! Implemented :)

migrate.go Outdated

// Persist that we are in the migration task phase for this
// version.
if err := m.databaseDrv.SetVersion(taskVersion, true); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line and another below are over our line limit. I think the CI isn't configured to run the project's linter right now?

I think we can run the linter as currently setup for this project using command:

docker run --rm -v "$(pwd)":/app -w /app golangci/golangci-lint:v1.64.8 golangci-lint run --config .golangci.yml ./...

Doesn't seem to complain about line length, but gives output:

migrate.go:255:15: ineffectual assignment to dirty (ineffassign)
                curVersion, dirty, err = m.databaseDrv.Version()
                            ^
migrate.go:298:15: ineffectual assignment to dirty (ineffassign)
                curVersion, dirty, err = m.databaseDrv.Version()
                            ^
migrate.go:342:15: ineffectual assignment to dirty (ineffassign)
                curVersion, dirty, err = m.databaseDrv.Version()
                            ^
migrate.go:381:15: ineffectual assignment to dirty (ineffassign)
                curVersion, dirty, err = m.databaseDrv.Version()
                            ^
migrate.go:436:3: ineffectual assignment to curVersion (ineffassign)
                curVersion, dirty, err = m.databaseDrv.Version()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled actions/workflows on the repo, so maybe CI will run now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled actions/workflows on the repo, so maybe CI will run now.

Awesome, thanks 🎉! The CI workflow seems to run now 🔥
Also addressed the feedback of this comment with the introduction of the function mentioned in #3 (comment)

migrate.go Outdated
Comment on lines 928 to 934
err := task(migr, m.databaseDrv)
if err != nil {
// Mark the database version as the taskVersion but in a
// clean state, to indicate that the migration task
// errored. We will therefore re-run the task on the
// next migration run.
if setErr := m.databaseDrv.SetVersion(taskVersion, false); setErr != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in TAP code i see:

the callback function that should be
// run _after_ the step was executed (but before the version is marked as
// cleanly executed). An error returned from the callback will cause the
// migration to fail and the step to be marked as dirty.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here? In other words, should we wrap the task callback in a single transaction enforced by the migrate package? Otherwise, could a task end up executing multiple migrations via m.databaseDrv, leading to non-atomic changes that we wouldn’t be able to roll back on error?

Copy link
Author

@ViktorT-11 ViktorT-11 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?

The reason we mark it as clean is that this is the definition for the migration task version behaviour, i.e. that if the database version is currently set to a migration task version and clean, this is a guarantee that the migration task was executed with an error on the last attempt. If the database version is set to a migration task version in a dirty state, we cannot know for sure wether the migration task was previously executed successfully or not (hence, requiring the manual intervention).

In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here?

I agree and do think in most scenarios it makes sense to execute the migration task in a single transaction, but I don't think that's always guaranteed to be the desired behaviour. For example:
When lnd uses this functionality to migrate from kvdb to sql through a migration task, it's likely that this will be needed to be done in multiple database transactions as it would simply be too much data to migrate in a single database transaction. So ultimately I do think it should up to the implementer of the migration task to define if the migration task should be run in a single db transaction or not, and therefore I don't think this should be up to the implementation of the migrate library.
Additionally, I'm not sure that all database backends supported in the migrate library do actually support transactions + rollbacks. Therefore I think we can't actually implement this in the migrate library, and it must be up to the caller to decide how to handle this.

Therefore, I didn't address the feedback of guaranteeing the executing of a migration task in a single db transaction through the migrate library. Do you agree with my reasoning 🙏?

migrate.go Outdated

select {
case r = <-migRet:
case <-time.After(30 * time.Second):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a const for this timeout. Something like:

const (
    // DeafultReadMigTimeout ...
    DeafultReadMigTimeout = 30 * time.Second
)

migrate.go Outdated
Comment on lines 997 to 998
// set clean state
if err = m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment here needs enhancement.

migrate.go Outdated
Comment on lines 989 to 941
err = m.execTask(migr)
if err != nil {
return err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add context to error here with fmt.Errorf?

@ViktorT-11 ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch 2 times, most recently from 1133fc1 to b98dde9 Compare October 13, 2025 23:38
@ViktorT-11 ViktorT-11 requested a review from ffranr October 13, 2025 23:49
Copy link

@GustavoStingelin GustavoStingelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good in terms of implementation, but I’m still missing some context about the motivation behind it.

@ViktorT-11, could you share a bit more background or point to any previous discussion about the issue this aims to solve?

From what I understood, this might be related to migrations that go beyond schema changes and include data updates, which could partially complete and time out. In that case, re-running the migration would have less data to process and might succeed on a retry. If that’s the intent, I’d love to discuss if we could tackle this problem in other ways, for example:

  • Making migrations more granular so each one processes smaller batches of data.
  • Adjusting the default timeout or providing clearer guidance on how to configure it for larger nodes, or even making it dynamically configurable.
  • Temporarily disabling constraints and using CONCURRENTLY where it’s available.

So, at first glance, the timeout scenario seems to be the main one this change would address. Are there other use cases that motivated it?

@ViktorT-11
Copy link
Author

@GustavoStingelin thanks again for looking through this PR as well 🙏!

From what I understood, this might be related to migrations that go beyond schema changes and include data updates

Exactly. A migration task is a separate step that can execute arbitrary code, and it runs immediately after a specific SQL schema migration. This step is primarily intended to operate on the data in the database once we are certain the schema has a known structure.

To make this more concrete, here is a real-world example of when we need such a task:

We are currently working on adding support for migrating the database backends in lnd and litd from bbolt to SQL. This involves iterating over every entry in the kvdb (bbolt) and inserting the equivalent row into the SQL database.

The reason this kvdb-to-SQL migration must be performed as a migration task is that we need the SQL tables to exactly match the schema that existed when the kvdb-to-SQL migration code was written. If the schema has changed because of later SQL migrations that were added after the kvdb-to-SQL code was created, then the migration could fail due to incompatible table definitions.

Therefore, the kvdb-to-SQL migration must run immediately after a specific SQL migration version (that is, as a migration task) and cannot be executed separately after all SQL schema migrations are complete.

This explains why we need support for arbitrary migration tasks. For the kvdb-to-SQL example, we also need the ability to re-run failed tasks on the next startup, which is what this PR adds. Since it is very difficult to anticipate every possible edge case in user data for lnd and litd, it is likely that some users will see migration errors during the kvdb-to-SQL migration (i.e. the migration task errors).

If we do not support retrying the task on the next startup, those users would not be able to re-trigger the migration even after we provide a hotfix for the kvdb-to-SQL code. The only alternative would be to have them manually modify their SQL database, which is far less ideal than simply retrying the migration on every startup until it succeeds.

I hope this clarifies the motivation behind migration tasks and the added functionality to retry them on startup, which is what this PR introduces 🙏

@GustavoStingelin
Copy link

Ok, thanks for sharing this context, concept ACK.

I noticed that the tests are failing because we already have some migration files using versions like 1885849751, which seem to represent timestamps. To fix this, I see two possible approaches:

  1. Rename the existing versions to be below 1000000000.
    This would remove the timestamp-style versions, but slightly break compatibility with future upstream migrations from the official repository.

  2. Increase TaskVersionOffset to 4000000000.
    This gives us a valid version space until around October 2, 2096, leaving the remaining 294967295 values available for contiguous migration versions (since Migration.version uses uint).

I think option 2 is enough for now. It is not a permanent solution, but it gives plenty of time to revisit this later if needed.

To reproduce the pipe error more easily:

$ go test ./database/postgres

  --- FAIL: Test (59.11s)
   --- FAIL: Test/testMigrate (0.00s)
       --- FAIL: Test/testMigrate/postgres:13 (3.47s)
            migrate_testing.go:30: UP
           migrate_testing.go:32: migration version 1085649617 is invalid, must be < 1000000000

test data: migrate/database/postgres/examples/migrations

@ViktorT-11 ViktorT-11 requested review from ziggie1984 and removed request for ellemouton October 29, 2025 10:40
@ViktorT-11
Copy link
Author

Thanks for the really great feedback @GustavoStingelin 🙏!

I've raised an internal discussion of how we want to tackle this issue, as it's unfortunately not as easy to solve except if we go for approach (1). The migration version is unfortunately being passed around as an int32 on 32bit OS builds, and with the offset approach added with this PR, the values available above the offset value needs to be at least the number of values existing below the offset. So the maximum offset we can use is therefore the max value of int32 / 2. Therefore approach (2) is not really viable either unfortunately...

I'll update you when we've reached a conclusion of how we want to tackle this issue.

@lightninglabs-deploy
Copy link

@ffranr: review reminder
@ziggie1984: review reminder
@ViktorT-11, remember to re-request review from reviewers when ready

This commit modifies the migration framework to re-attempt migration
tasks if they error during a migration, on the next run. Previously, if
a migration task failed, but their associated SQL migration succeeded,
the database version would be set to a dirty state, and require manual
intervention in order to reset the SQL migration and re-attempt it + the
migration task.

The new re-attempt mechanism is achieved by ensuring that a migration
can only be either an SQL migration or a migration task, but not both.
This way, if a migration task errors, the database version will be
reset to the previous version prior to executing the migration task, and
the migration task will be re-attempted on the next run.
@ViktorT-11 ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from b98dde9 to b2047a3 Compare November 13, 2025 01:24
@ViktorT-11
Copy link
Author

Updated the approach to achieve the re-run to the following logic:

We now define a migration as that it can either be an SQL migration or a Migration task, but not both.

If a migration task errors, we will reset the database version to the version it was set to before attempting to execute the migration task. Therefore, the migration task will be re-executed once more on the next startup.

That completely removes the need for the "migration task offset" approach.

I also suggest that we now rename the "Migration task"/"post migration step" to a "code migration" instead, as it better explains what that migration type actually is now that it cannot also be an SQL migration. Reviewers, do you agree with that reasoning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants