-
Notifications
You must be signed in to change notification settings - Fork 0
Introduce CHANGE_LOG strategy #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Coverage Report Results
1 empty file skipped. |
timb07
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some initial comments. Overall, this looks like it should work! :)
I've tried to come up with scenarios of sequences of operations on the source table that span the backfill, sync schemas or post sync stages and somehow don't correctly get replicated to the copy table, but so far everything I've come up with should be handled correctly. :)
I'd like a chance to test this out locally before I give an approval.
| assert "change_log_trigger" in columns | ||
| assert "change_log" in columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐼 For completeness, we should assert that "change_log_function" and "change_log_copy_function" are in the list of columns as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixup: a426df9
src/psycopack/_commands.py
Outdated
| SELECT {columns} | ||
| FROM {schema}.{table_from} | ||
| WHERE {pk_column} = ANY (pks) | ||
| ON CONFLICT DO NOTHING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we've just deleted any rows in the destination table with PK matching any of the PKs in pks, how could we ever have a conflict here? Would it be better to remove the ON CONFLICT, and throw an error if we encounter this unexpected situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be better to raise an error. One scenario where this might occur is if the Psycopack user interfered in the process to add a new unique/exclusion constraint in the fly (after sync schemas), and now potentially re-inserting a row that has since changed in the source table in violation of the constraint would raise an error.
Fixup: 84fdc77
tests/test_repack.py
Outdated
| "pk_type", | ||
| ("bigint", "bigserial", "integer", "serial", "smallint", "smallserial"), | ||
| ) | ||
| def test_repack_with_changes_log_strategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐼 typo: "..._change_log_..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixup: 6d0da34
| POST_SYNC_UPDATE = StageInfo(name="POST_SYNC_UPDATE", step=5) | ||
| SWAP = StageInfo(name="SWAP", step=6) | ||
| CLEAN_UP = StageInfo(name="CLEAN_UP", step=7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect adding a stage and renumbering the two later stages might cause issues for any Psycopack operations that are currently underway. This would be difficult to handle, so we should caution users to complete any Psycopack conversions before updating to this new version of Psycopack with the new CHANGE_LOG strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Worth putting a note on the CHANGELOG to alert that.
|
|
||
| # But it is in the change log. | ||
| cur.execute(f"SELECT * FROM {repack.change_log};") | ||
| assert cur.fetchall() == [(1, 101), (3, 9999), (5, 102)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created_row_id is 102, right? Should we assert that at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it is the only insert in the table post the assertion above.
I added the id assertion in this fixup: ae892fd
src/psycopack/_repack.py
Outdated
| # The change log trigger and function have already been | ||
| # dropped during the schema sync stage. The table is the | ||
| # only artefact remaining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment (copied from the clean_up method) doesn't apply here. The code below correctly drops the trigger and function, since reset could be called before the schema sync stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixup: a30ee16
feb83ec to
a426df9
Compare
This commit starts the work to allow Psycopack to perform the Sync Schema stage without CONCURRENT statements. This will allow the Sync Schema stage to run to completion without being blocked by long-running transactions. For that to happen, data synchronisation from parent to copy must be aided by a separated log table that tracks rows from the parent table that have changed (thus CHANGE_LOG strategy). This commit introduces a new argument to the Psycopack class and the enum itself, these changes in specific don't do anything meaningful yet. A base test is also introduced to be iterated over in future changes.
In subsequent changes there will be another type of trigger needed to service SyncStrategy.CHANGES_LOG. That will be a trigger from the source table to the changes log table. This change specialises the name to avoid confusion in terminology.
This change adds four new columns to the Registry table: - sync_strategy - change_log - change_log_function - change_log_trigger The change also handles updating existing Registry tables for users of Psycopack who already started processing their tables so that they can also get the update for these fields. This is part of a larger work to allow Psycopack to have a new data and schema synchronisation strategy that uses a change log table.
- Change log table: Keeps a record of writes against the table. It only stores the primary key of the source table row that changed. - Function: Used by the trigger to store the PK that changed in the change log table. - Trigger: pulls changes from the source table into the change log using the function above. - Copy function: Used to copy data from the change log onto the copy table
The new synchronisation strategy requires a new step after synchronising the schema so that all the changes that happened in the original table since the Psycopack process began can be processed through. This commit only adds the changes required by the tracker, leaving the implementation of this stage to a subsequent commit.
The post_sync_update stage is responsible for processing rows that have changed in the original table since the Psycopack setup stage began. These rows are pushed into the copy table using an idempotent process that: - Locks rows from the original and copy table to be backfilled so that no changes happen to them while they are being backfilled to the copy table. - Lock the assigned rows from the change log table too, so that they aren't changed in the interim. - Delete any rows from the copy table that match the rows being processed, this is to handle inserts, updates, and deletes idempotently. - Insert into the copy table the exact row from the original table.
If the strategy is CHANGE_LOG, we can put back the trigger to copy data from the source table into the copy table and remove the change_log table trigger. This will ensure the change_log does not grow forever, which would be tricky to deal with in code. This also makes the swap and revert_swap process trivial, as there won't be any differences between the CHANGE_LOG and the DIRECT_TRIGGER strategies.
When using the CHANGE_LOG strategy, the copy table has no hard dependency on the source table and therefore doesn't need to create indexes CONCURRENTLY.
a30ee16 to
41190a3
Compare
Background
The way Psycopack works today is by having a trigger to synchronise data
between the source table and the copy table.
One caveat here is that the trigger itself establishes a hard link between the
source table and the copy table.
Effectivelly this means that applying any hard locks on the the copy table
would practically lock the source table as well.
Due to this constraint, the sync schemas stage on Psycopack relies on DDLs that
don't lock the copy table to avoid blockage on the source table too.
This ends up having an effect on how Psycopack creates indexes on the copy
table, as such indexes have to be created using CONCURRENTLY to avoid taking an
access exclusive lock on the copy table (which would consequently block the
trigger on the source table of operating).
Due to the way concurrent index creation works in Postgres, an index is not
valid until all transactions that were alive before the index creation ddl
fired are finished.
This means that environments using Psycopack that happen to have long-running
transactions would effectively delay the sync schema stage until those old
transactions are finished.
Solution (this change)
This pull request introduces a new back-end strategy for synchronising data
between the source table and the copy table that does not involve a direct link
between these two tables.
This new concept is defined as a "sync strategy" (for data synchronisation).
The new sync strategy is called CHANGE LOG, as it involves the creation of an
intermediary table where changes from the source table are kept in, before the
two tables are fully synchronised in terms of data. This intermediary table is
called the "change log".
Having a change log, means that the sync schema stage can operate without
CONCURRENT index creation, such that this stage doesn't get blocked of advanced
due to long-running transactions.
In order to guarantee perfect data synchronisation between source and copy
tables, the direct trigger from src to copy is again added, but only after the
schema sync has succeeded.
A high-level explanation of what the new functionality on each stage of
Psycopack when using the CHANGE LOG sync strategy are as follow:
from the source table to the copy table, the trigger goes into the change log
table. Any rows updated/created/deleted on the source table will be reflected
in the change log table effectively from now.
in place. Once the schema is updated, drop the change log trigger and create
a direct trigger from src to copy. This guarantees that the change log
doesn't grow forever, and that any new writes go straight to the copy table
instead.
updates on the copy table based on rows that changed between the setup and
sync_schema stages..