Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft ADR for bigint id column migration #4191

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

johha
Copy link
Contributor

@johha johha commented Jan 28, 2025

To avoid overflows of the id primary key column we should migrate it to bigint.
The purpose of this ADR is to find a good approach and to document all relevant decisions.

Feedback is highly appreciated 🙂

@johha johha force-pushed the bigint-migration-adr branch 3 times, most recently from 8790c52 to 41c4af3 Compare February 4, 2025 13:37
Copy link
Contributor

@jochenehret jochenehret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any considerations for the Ruby sequel library necessary?

Add draft ADR for bigint id column migration

Co-authored-by: Johannes Haass <[email protected]>
@johha johha force-pushed the bigint-migration-adr branch from 41c4af3 to c874d65 Compare February 6, 2025 07:40
@Samze
Copy link
Contributor

Samze commented Feb 12, 2025

Just as a heads up, even though MySQL 5.7 is EOL. Since its still available from certain cloud providers, we technically still support it until 2025-07-31.

So if we're thinking of doing this before that date, we'd need to test against that MySQL version.

@johha
Copy link
Contributor Author

johha commented Feb 14, 2025

Just as a heads up, even though MySQL 5.7 is EOL. Since its still available from certain cloud providers, we technically still support it until 2025-07-31.

So if we're thinking of doing this before that date, we'd need to test against that MySQL version.

👍 According to the mysql docs there should no downtime when renaming columns (source). Still someone who has access to a copy of a productive ccdb should try out the migration procedure.

stephanme
stephanme previously approved these changes Feb 24, 2025
@johha johha marked this pull request as ready for review February 24, 2025 15:57
- Backfill will be executed outside the migration due to its potentially long runtime.
- If necessary the backfill will run for multiple weeks to ensure all records are processed.

#### Step 3a - Migration Pre Check
Copy link
Contributor

@Samze Samze Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Are 3a/3b steps in different cf-deployment releases? And does this mean that people have to upgrade in sequence for this migration to be successful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 3a is just a check to ensure we can start the migration.
The plan is to ship 3a & 3b with the same CAPI/CF-D release. We will have one major release for steps 1 & 2 which prepares the migration and another major for steps 3a & 3b for the actual migration.
Between those two releases there should be reasonable time - maybe something like 60 days.

Copy link
Contributor

@Samze Samze Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the behaviour if someone goes from a cfd version before this to a cfd version that contains all the changes?

Will the migration still work just with a long deploy time?

One issue for us is that we do not enforce sequential updates of patch releases for customers. So it'll be difficult for us to do this phased approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CF-D has a rule that major versions cannot be skipped (Docs):

No major version must be skipped when updating cf-deployment. If an operator skips a major version, a seamless update is no longer guaranteed. This additional constraint is necessary for features which must be rolled out step-by-step, e.g. complex database migrations.

- Add a `CHECK` constraint to verify that id_bigint is fully populated (`id_bigint == id & id_bigint != NULL`).
- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration exits gracefully and is retried in the next deploy.
#### Step 3b - Actual Migration
- Remove the `CHECK` constraint once verified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Do we need to know that the check constraint was successfully added in 3a before we can start the migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the check constraint in 3a would fail if there are still missing entries.


#### Step 3a - Migration Pre Check
- Add a `CHECK` constraint to verify that id_bigint is fully populated (`id_bigint == id & id_bigint != NULL`).
- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration exits gracefully and is retried in the next deploy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Could you clarify migration exits gracefully? Does the deploy still succeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be actually better if the migration fails and then the next deploy would retry it.
We'll check this again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for letting the migration fail if backfill is not yet complete.

This will be the case when an update over multiple major cf-d versions gets applied so that step 1 and step 3 run directly one after each other (i.e. step 2 = backfill doesn't have a chance to run).
We need a procedure how to resolve such situations. E.g. running the backfill manually as a rake task.

Unknown at this point. Further investigation is needed.

### Rollback Mechanism
The old `id` column is no longer retained, as the `CHECK` constraint ensures correctness during migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rename the old id column and drop later if required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought about that but if we keep the column by renaming it to e.g. id_old it will be nil for newly inserted rows as it is no longer the primary key.
By adding some function and trigger similar to step 1 id_old could be still filled but this adds additional complexity and we would need another migration step to clean up id_old.
As the migration step 3b is executed in a single transaction it will be rolled backed automatically if something goes wrong. Thus we don't need the id_old column

@abg
Copy link
Member

abg commented Feb 25, 2025

In MySQL these sort PK constraint changes (dropping, adding or dropping+adding concurrently) suggested in this proposal will result in table rebuild(s):

https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html#online-ddl-primary-key-operations

A MySQL primary key is "clustered" and referenced in every secondary index as a sort of "row pointer" so fiddling with primary keys like this is a very expensive operation in MySQL.

In some MySQL environments, this may not block writes if applied appropriately with the right database configuration. Environments using clustering solutions like Galera (e.g. pxc-release in cf-deployment), DDL is a synchronization point for replication and is going to stall writes to the database until the operation is complete.

Typically the MySQL community's approach to this class of problem is creating a new empty "ghost" / "shadow" table with the desired structure and doing a copy / "backfill" operation (typically similar to the process proposed here), and swapping tables at the end.

Some examples of "state-of-the-art" from the MySQL community:

Foreign keys add a lot of difficulty to this process - although that's not necessarily specific to MySQL.

@sethboyles
Copy link
Member

sethboyles commented Feb 27, 2025

@johha what do you think about making this an opt-in process? Either the whole process, or at least making Phase 3 able to be deferred at Operator discretion.

We're still in the process of validating this, but we don't think this is a concern for us in the short to medium term. The fact that this is a multi phase process will make it hard to coordinate (as we don't use CFD). We'd prefer to choose when we have our foundations run these migrations.

I'm not sure how easy it would be to set up a second migrations directory or schema_migrations table, and what issues might arise from having it uncertain if id columns are int or bigint (I think that the ORM will take care of that for us, though).

The tables mentioned (events, delayed_jobs, jobs, and app_usage_events) are probably ones that could easily have their increment markers reset, so if it became an issue for us I think that's the first thing solution we would look toward.

@johha
Copy link
Contributor Author

johha commented Mar 3, 2025

@sethboyles I understand that smaller foundations will probably never run into this problem and thus would like have to option to skip the bigint migration.
The easiest way to implement this would be to add a config flag and if this flag is set all related migrations will result in a no-op (they would still appear in the schema_versions table). Operators choosing this option would be required to apply those migrations manually when the id gets close to a overflow.
Therefore I'd prefer a opt-out process so that operators need to take a conscious decision.
Would this be a valid option for you?

We're also discussing currently if we should change the default type for id columns to bigint when creating new schemas. With this at least newly created foundations would never run into a overflow situation. Any thoughts on that?

@sethboyles
Copy link
Member

I think that's a potential solution. There's still a chance that instead of never running into this problem it won't be a problem for 5-10 years, and then we will need to think if we want some different solution. We're still collecting numbers to validate this.

For new foundations to have bigint as the default instead of int, that seems reasonable to me.

@johha
Copy link
Contributor Author

johha commented Mar 6, 2025

I added a description of the opt out procedure and some mysql specifics to the ADR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants