migrate: re-run post-step callbacks on error #3

ViktorT-11 · 2025-09-25T00:56:55Z

This commit modifies the migration framework to re-attempt post-step callbacks if they error during a migration, and won't leave the database version as dirty.

Implemenation approach:
This is achieved by introducing the concept of a "post-step callback" migration version. Post-step callbacks are their corresponding SQL migration version offset by +1000000000.

During the execution of a post-step callback, the post-step callback migration version will be persisted as the database version. That way, if the post-step callback errors, the version for the database will be the post-step callback version on the next startup. The post-step callback will then be re-attempted before proceeding with the next SQL migration.

Alternative approach:
Another solution to the issue, would be to individually track if a post migration has been executed successfully or not in a separate version table for post migrations. I chose not to go for that approach as that would require individual implementation for every database backend type and therefore be a much larger change.

I'm open to change the approach though if reviewers do not prefer the approach implemented by this PR.

ffranr

Nice! Version offset approach seems very clean to me! Nothing jumps out as being unviable so far.

Can we add unit tests to this PR which exercise the changes you've made? Checks the version at failed callback. Check that callback is executed once more at failed migration etc.

General naming suggestion:

Instead of PostStepCallback, maybe we should consider using something shorter like task or hook. Just to reduce verbosity and simplify the naming as we add more to this functionality.

For example:

execTask(*Migration) vs executePostStepCallback
execTaskAtMigVersion(int) vs executePostStepCallbackForSQLMig
InTaskVersionRange(int) vs IsPostStepCallbackVersion
TaskVersionOffset vs PostStepCallbackOffset

If we already know that a task in this context is a callback which runs after a migration step, then we don't need to prefix everything with "PostStep".

We could even add a new tasks.go and tasks_test.go files for example where we can document this feature better and keep our changes minimal in migrate.go, util.go, and migrate_test.go.

migrate.go

ViktorT-11 · 2025-09-30T18:00:45Z

Thanks a lot for the review @ffranr 🙏!

Instead of PostStepCallback, maybe we should consider using something shorter like task

I definietly agree that the name PostStepCallback is confusing, so I added some additional commits which renames it to task/MigrationTask. Happy to address any feedback there if you still don't agree with my naming :).

I placed the main code for that struct/funcs in a new tasks.go function. I opted to not move the migration tests for the MigrationTask migration to a task_test.go file though, as I do believe it amkes more sense to keep all migration tests in the same file.

Can we add unit tests to this PR which exercise the changes you've made?

Good idea :). I added a new TestMigrationTaskError to the migrate_test.go file to address this!

ffranr

Thanks for the name changes, much easier to read now IMO!

I'm not sure about this: https://github.com/lightninglabs/migrate/pull/3/files#r2413777619

ffranr · 2025-10-08T12:25:54Z

migrate.go

+			return m.unlockErr(err)
+		}
+
+		curVersion, dirty, err = m.databaseDrv.Version()


In this file, I see several new calls like curVersion, dirty, err = m.databaseDrv.Version(). We don’t appear to check the dirty value for these new calls. Should we add something like this here and in similar spots?

if dirty { return m.unlockErr(ErrDirty{curVersion}) }

If not, we should replace the unused return values with _ and add a brief comment explaining why it’s safe to ignore them.

I wander if we shouldn't extract the whole if InTaskVersionRange(curVersion) { block into a new function since it's repeated ~5 times. Maybe we can call that function maybeExecTask or something like that.

I broke ut much of the repeated logic + a little more into a new ensureCleanCurrentSQLVersion to address this feedback :). Hope you like it!

ffranr · 2025-10-08T12:29:29Z

migrate.go

+	// If the current version is a clean migration task version, then we
+	// need to rerun the task for the previous version before we can
+	// continue with any SQL migration(s).


Maybe we can add more to this comment.

// If the current version is a clean migration task version, then we // need to rerun the task before we can // continue with any SQL migration(s). This is because...

because the migration process ended cleanly but the task did not complete? We know that the task didn't complete because the version is a task version and not an SQL migrate version.

Is that correct?

Addressed.

We know that the task didn't complete because the version is a task version and not an SQL migrate version.

Correct, and specifically also that the database is set to a clean state + a migration task version. That can only ever happen if the migration task was run but returned an error.

ffranr · 2025-10-08T12:34:41Z

migrate.go

-						"finished for %v\n", migr.LogString())
+				err := m.execTask(migr)
+				if err != nil {
+					return err


consider adding context message to this err using fmt.Errorf.

ffranr · 2025-10-08T12:38:06Z

migrate.go

+	task, ok := m.opts.tasks[migr.Version]
+	if ok {


I think we should return early here if !ok to save on indent. Maybe something like:

task, ok := m.opts.tasks[migr.Version] if !ok { m.logVerbosePrintf("No migration task found for %v\n", migr.LogString()) return nil }

Good idea, thanks 🙏! Implemented :)

ffranr · 2025-10-08T12:51:36Z

migrate.go

+
+		// Persist that we are in the migration task phase for this
+		// version.
+		if err := m.databaseDrv.SetVersion(taskVersion, true); err != nil {


this line and another below are over our line limit. I think the CI isn't configured to run the project's linter right now?

I think we can run the linter as currently setup for this project using command:

docker run --rm -v "$(pwd)":/app -w /app golangci/golangci-lint:v1.64.8 golangci-lint run --config .golangci.yml ./...

Doesn't seem to complain about line length, but gives output:

migrate.go:255:15: ineffectual assignment to dirty (ineffassign) curVersion, dirty, err = m.databaseDrv.Version() ^ migrate.go:298:15: ineffectual assignment to dirty (ineffassign) curVersion, dirty, err = m.databaseDrv.Version() ^ migrate.go:342:15: ineffectual assignment to dirty (ineffassign) curVersion, dirty, err = m.databaseDrv.Version() ^ migrate.go:381:15: ineffectual assignment to dirty (ineffassign) curVersion, dirty, err = m.databaseDrv.Version() ^ migrate.go:436:3: ineffectual assignment to curVersion (ineffassign) curVersion, dirty, err = m.databaseDrv.Version()

I've enabled actions/workflows on the repo, so maybe CI will run now.

I've enabled actions/workflows on the repo, so maybe CI will run now.

Awesome, thanks 🎉! The CI workflow seems to run now 🔥
Also addressed the feedback of this comment with the introduction of the function mentioned in #3 (comment)

ffranr · 2025-10-08T12:58:34Z

migrate.go

+		err := task(migr, m.databaseDrv)
+		if err != nil {
+			// Mark the database version as the taskVersion but in a
+			// clean state, to indicate that the migration task
+			// errored. We will therefore re-run the task on the
+			// next migration run.
+			if setErr := m.databaseDrv.SetVersion(taskVersion, false); setErr != nil {


Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?

in TAP code i see:

the callback function that should be // run _after_ the step was executed (but before the version is marked as // cleanly executed). An error returned from the callback will cause the // migration to fail and the step to be marked as dirty.

In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here? In other words, should we wrap the task callback in a single transaction enforced by the migrate package? Otherwise, could a task end up executing multiple migrations via m.databaseDrv, leading to non-atomic changes that we wouldn’t be able to roll back on error?

Are we assuming here that the migration task rolls back any partial changes it made, so that if it returns an error we can still mark the state as “clean”? Or is that guaranteed behavior? If it’s just an assumption, can we please update the comment to clarify why the state is marked as “clean” even when the task fails?

The reason we mark it as clean is that this is the definition for the migration task version behaviour, i.e. that if the database version is currently set to a migration task version and clean, this is a guarantee that the migration task was executed with an error on the last attempt. If the database version is set to a migration task version in a dirty state, we cannot know for sure wether the migration task was previously executed successfully or not (hence, requiring the manual intervention).

In tapd, I see that the task callback is executed within a database transaction (see makePostStepCallbacks in tapdb/post_migration_checks.go). Do we need something similar here?

I agree and do think in most scenarios it makes sense to execute the migration task in a single transaction, but I don't think that's always guaranteed to be the desired behaviour. For example:
When lnd uses this functionality to migrate from kvdb to sql through a migration task, it's likely that this will be needed to be done in multiple database transactions as it would simply be too much data to migrate in a single database transaction. So ultimately I do think it should up to the implementer of the migration task to define if the migration task should be run in a single db transaction or not, and therefore I don't think this should be up to the implementation of the migrate library.
Additionally, I'm not sure that all database backends supported in the migrate library do actually support transactions + rollbacks. Therefore I think we can't actually implement this in the migrate library, and it must be up to the caller to decide how to handle this.

Therefore, I didn't address the feedback of guaranteeing the executing of a migration task in a single db transaction through the migrate library. Do you agree with my reasoning 🙏?

ffranr · 2025-10-08T13:11:17Z

migrate.go

+
+	select {
+	case r = <-migRet:
+	case <-time.After(30 * time.Second):


I think we should add a const for this timeout. Something like:

const ( // DeafultReadMigTimeout ... DeafultReadMigTimeout = 30 * time.Second )

ffranr · 2025-10-08T13:14:04Z

migrate.go

+		// set clean state
+		if err = m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil {


Comment here needs enhancement.

ffranr · 2025-10-08T13:15:06Z

migrate.go

+		err = m.execTask(migr)
+		if err != nil {
+			return err
+		}


add context to error here with fmt.Errorf?

GustavoStingelin

This PR looks good in terms of implementation, but I’m still missing some context about the motivation behind it.

@ViktorT-11, could you share a bit more background or point to any previous discussion about the issue this aims to solve?

From what I understood, this might be related to migrations that go beyond schema changes and include data updates, which could partially complete and time out. In that case, re-running the migration would have less data to process and might succeed on a retry. If that’s the intent, I’d love to discuss if we could tackle this problem in other ways, for example:

Making migrations more granular so each one processes smaller batches of data.
Adjusting the default timeout or providing clearer guidance on how to configure it for larger nodes, or even making it dynamically configurable.
Temporarily disabling constraints and using CONCURRENTLY where it’s available.

So, at first glance, the timeout scenario seems to be the main one this change would address. Are there other use cases that motivated it?

ViktorT-11 · 2025-10-27T08:41:19Z

@GustavoStingelin thanks again for looking through this PR as well 🙏!

From what I understood, this might be related to migrations that go beyond schema changes and include data updates

Exactly. A migration task is a separate step that can execute arbitrary code, and it runs immediately after a specific SQL schema migration. This step is primarily intended to operate on the data in the database once we are certain the schema has a known structure.

To make this more concrete, here is a real-world example of when we need such a task:

We are currently working on adding support for migrating the database backends in lnd and litd from bbolt to SQL. This involves iterating over every entry in the kvdb (bbolt) and inserting the equivalent row into the SQL database.

The reason this kvdb-to-SQL migration must be performed as a migration task is that we need the SQL tables to exactly match the schema that existed when the kvdb-to-SQL migration code was written. If the schema has changed because of later SQL migrations that were added after the kvdb-to-SQL code was created, then the migration could fail due to incompatible table definitions.

Therefore, the kvdb-to-SQL migration must run immediately after a specific SQL migration version (that is, as a migration task) and cannot be executed separately after all SQL schema migrations are complete.

This explains why we need support for arbitrary migration tasks. For the kvdb-to-SQL example, we also need the ability to re-run failed tasks on the next startup, which is what this PR adds. Since it is very difficult to anticipate every possible edge case in user data for lnd and litd, it is likely that some users will see migration errors during the kvdb-to-SQL migration (i.e. the migration task errors).

If we do not support retrying the task on the next startup, those users would not be able to re-trigger the migration even after we provide a hotfix for the kvdb-to-SQL code. The only alternative would be to have them manually modify their SQL database, which is far less ideal than simply retrying the migration on every startup until it succeeds.

I hope this clarifies the motivation behind migration tasks and the added functionality to retry them on startup, which is what this PR introduces 🙏

GustavoStingelin · 2025-10-28T17:51:09Z

Ok, thanks for sharing this context, concept ACK.

I noticed that the tests are failing because we already have some migration files using versions like 1885849751, which seem to represent timestamps. To fix this, I see two possible approaches:

Rename the existing versions to be below 1000000000.
This would remove the timestamp-style versions, but slightly break compatibility with future upstream migrations from the official repository.
Increase TaskVersionOffset to 4000000000.
This gives us a valid version space until around October 2, 2096, leaving the remaining 294967295 values available for contiguous migration versions (since Migration.version uses uint).

I think option 2 is enough for now. It is not a permanent solution, but it gives plenty of time to revisit this later if needed.

To reproduce the pipe error more easily:

$ go test ./database/postgres

  --- FAIL: Test (59.11s)
   --- FAIL: Test/testMigrate (0.00s)
       --- FAIL: Test/testMigrate/postgres:13 (3.47s)
            migrate_testing.go:30: UP
           migrate_testing.go:32: migration version 1085649617 is invalid, must be < 1000000000

test data: migrate/database/postgres/examples/migrations

ViktorT-11 · 2025-11-04T17:35:05Z

Thanks for the really great feedback @GustavoStingelin 🙏!

I've raised an internal discussion of how we want to tackle this issue, as it's unfortunately not as easy to solve except if we go for approach (1). The migration version is unfortunately being passed around as an int32 on 32bit OS builds, and with the offset approach added with this PR, the values available above the offset value needs to be at least the number of values existing below the offset. So the maximum offset we can use is therefore the max value of int32 / 2. Therefore approach (2) is not really viable either unfortunately...

I'll update you when we've reached a conclusion of how we want to tackle this issue.

lightninglabs-deploy · 2025-11-11T18:06:00Z

@ffranr: review reminder
@ziggie1984: review reminder
@ViktorT-11, remember to re-request review from reviewers when ready

This commit modifies the migration framework to re-attempt migration tasks if they error during a migration, on the next run. Previously, if a migration task failed, but their associated SQL migration succeeded, the database version would be set to a dirty state, and require manual intervention in order to reset the SQL migration and re-attempt it + the migration task. The new re-attempt mechanism is achieved by ensuring that a migration can only be either an SQL migration or a migration task, but not both. This way, if a migration task errors, the database version will be reset to the previous version prior to executing the migration task, and the migration task will be re-attempted on the next run.

ViktorT-11 · 2025-11-13T01:29:19Z

Updated the approach to achieve the re-run to the following logic:

We now define a migration as that it can either be an SQL migration or a Migration task, but not both.

If a migration task errors, we will reset the database version to the version it was set to before attempting to execute the migration task. Therefore, the migration task will be re-executed once more on the next startup.

That completely removes the need for the "migration task offset" approach.

I also suggest that we now rename the "Migration task"/"post migration step" to a "code migration" instead, as it better explains what that migration type actually is now that it cannot also be an SQL migration. Reviewers, do you agree with that reasoning?

ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from ed3932c to 7599e4c Compare September 25, 2025 09:53

levmi requested review from ellemouton and ffranr September 25, 2025 14:58

ffranr reviewed Sep 26, 2025

View reviewed changes

migrate.go Outdated Show resolved Hide resolved

ffranr assigned ViktorT-11 Sep 26, 2025

ffranr added the enhancement New feature or request label Sep 26, 2025

ViktorT-11 mentioned this pull request Sep 30, 2025

tapdb: error on dirty database versions lightninglabs/taproot-assets#1826

Merged

ViktorT-11 added 2 commits September 30, 2025 18:57

migrate: rename PostStepCallback to MigrationTask

a79e50e

migrate: move MigrationTask to separate file

35f3e3e

ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from 7599e4c to 226c8d7 Compare September 30, 2025 17:53

ViktorT-11 requested a review from ffranr September 30, 2025 18:01

ViktorT-11 mentioned this pull request Oct 8, 2025

sqldb/v2 as separate package lightningnetwork/lnd#10175

Open

ffranr reviewed Oct 8, 2025

View reviewed changes

ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch 2 times, most recently from 1133fc1 to b98dde9 Compare October 13, 2025 23:38

ViktorT-11 requested a review from ffranr October 13, 2025 23:49

GustavoStingelin reviewed Oct 24, 2025

View reviewed changes

ViktorT-11 requested review from ziggie1984 and removed request for ellemouton October 29, 2025 10:40

ViktorT-11 force-pushed the 2025-09-rerun-post-step-callbacks-on-error branch from b98dde9 to b2047a3 Compare November 13, 2025 01:24

ViktorT-11 requested a review from GustavoStingelin November 13, 2025 01:29

		// set clean state
		if err = m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil {

migrate: re-run post-step callbacks on error #3

Are you sure you want to change the base?

migrate: re-run post-step callbacks on error #3

Uh oh!

Conversation

ViktorT-11 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffranr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

General naming suggestion:

Uh oh!

Uh oh!

ViktorT-11 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffranr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ViktorT-11 Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GustavoStingelin left a comment

Choose a reason for hiding this comment

Uh oh!

ViktorT-11 commented Oct 27, 2025

Uh oh!

GustavoStingelin commented Oct 28, 2025

Uh oh!

ViktorT-11 commented Nov 4, 2025

Uh oh!

lightninglabs-deploy commented Nov 11, 2025

Uh oh!

ViktorT-11 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ViktorT-11 commented Sep 25, 2025 •

edited

Loading

ffranr left a comment •

edited

Loading

ViktorT-11 commented Sep 30, 2025 •

edited

Loading

ViktorT-11 Oct 13, 2025 •

edited

Loading