Skip to content

Conversation

@marcelofern
Copy link
Collaborator

Background

The way Psycopack works today is by having a trigger to synchronise data
between the source table and the copy table.

One caveat here is that the trigger itself establishes a hard link between the
source table and the copy table.

Effectivelly this means that applying any hard locks on the the copy table
would practically lock the source table as well.

Due to this constraint, the sync schemas stage on Psycopack relies on DDLs that
don't lock the copy table to avoid blockage on the source table too.

This ends up having an effect on how Psycopack creates indexes on the copy
table, as such indexes have to be created using CONCURRENTLY to avoid taking an
access exclusive lock on the copy table (which would consequently block the
trigger on the source table of operating).

Due to the way concurrent index creation works in Postgres, an index is not
valid until all transactions that were alive before the index creation ddl
fired are finished.

This means that environments using Psycopack that happen to have long-running
transactions would effectively delay the sync schema stage until those old
transactions are finished.

Solution (this change)

This pull request introduces a new back-end strategy for synchronising data
between the source table and the copy table that does not involve a direct link
between these two tables.

This new concept is defined as a "sync strategy" (for data synchronisation).

The new sync strategy is called CHANGE LOG, as it involves the creation of an
intermediary table where changes from the source table are kept in, before the
two tables are fully synchronised in terms of data. This intermediary table is
called the "change log".

Having a change log, means that the sync schema stage can operate without
CONCURRENT index creation, such that this stage doesn't get blocked of advanced
due to long-running transactions.

In order to guarantee perfect data synchronisation between source and copy
tables, the direct trigger from src to copy is again added, but only after the
schema sync has succeeded.

A high-level explanation of what the new functionality on each stage of
Psycopack when using the CHANGE LOG sync strategy are as follow:

  1. setup: the change log table is created here, and instead of having a trigger
    from the source table to the copy table, the trigger goes into the change log
    table. Any rows updated/created/deleted on the source table will be reflected
    in the change log table effectively from now.
  2. backfill: no changes here, backfill still works in batches.
  3. sync_schema: instead of creating indexes CONCURRENTLY, simply create indexes
    in place. Once the schema is updated, drop the change log trigger and create
    a direct trigger from src to copy. This guarantees that the change log
    doesn't grow forever, and that any new writes go straight to the copy table
    instead.
  4. Post sync update [new stage]: This stage reads the change log, and perform
    updates on the copy table based on rows that changed between the setup and
    sync_schema stages..
  5. Swap: nothing changes
  6. Clean up: the change_log table is dropped here.

@marcelofern marcelofern requested a review from a team as a code owner January 5, 2026 23:20
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Coverage Report Results

Name Stmts Miss Branch BrPart Cover
src/psycopack/__init__.py 8 0 0 0 100%
src/psycopack/_commands.py 130 0 6 0 100%
src/psycopack/_conn.py 5 0 0 0 100%
src/psycopack/_const.py 3 0 0 0 100%
src/psycopack/_cur.py 24 0 2 0 100%
src/psycopack/_identifiers.py 12 0 2 0 100%
src/psycopack/_introspect.py 183 0 18 0 100%
src/psycopack/_logging.py 2 0 0 0 100%
src/psycopack/_psycopg.py 5 0 0 0 100%
src/psycopack/_registry.py 88 0 12 0 100%
src/psycopack/_repack.py 445 3 148 3 99%
src/psycopack/_sync_strategy.py 4 0 0 0 100%
src/psycopack/_tracker.py 182 2 46 2 98%
tests/conftest.py 19 0 0 0 100%
tests/factories.py 41 0 10 0 100%
tests/test_cur.py 20 0 2 0 100%
tests/test_fixtures.py 5 0 0 0 100%
tests/test_package.py 3 0 0 0 100%
tests/test_repack.py 907 0 16 0 100%
TOTAL 2086 5 262 5 99%

1 empty file skipped.

Copy link
Contributor

@timb07 timb07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some initial comments. Overall, this looks like it should work! :)

I've tried to come up with scenarios of sequences of operations on the source table that span the backfill, sync schemas or post sync stages and somehow don't correctly get replicated to the copy table, but so far everything I've come up with should be handled correctly. :)

I'd like a chance to test this out locally before I give an approval.

Comment on lines +880 to +913
assert "change_log_trigger" in columns
assert "change_log" in columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 For completeness, we should assert that "change_log_function" and "change_log_copy_function" are in the list of columns as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: a426df9

SELECT {columns}
FROM {schema}.{table_from}
WHERE {pk_column} = ANY (pks)
ON CONFLICT DO NOTHING;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we've just deleted any rows in the destination table with PK matching any of the PKs in pks, how could we ever have a conflict here? Would it be better to remove the ON CONFLICT, and throw an error if we encounter this unexpected situation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be better to raise an error. One scenario where this might occur is if the Psycopack user interfered in the process to add a new unique/exclusion constraint in the fly (after sync schemas), and now potentially re-inserting a row that has since changed in the source table in violation of the constraint would raise an error.

Fixup: 84fdc77

"pk_type",
("bigint", "bigserial", "integer", "serial", "smallint", "smallserial"),
)
def test_repack_with_changes_log_strategy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 typo: "..._change_log_..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: 6d0da34

Comment on lines +62 to +64
POST_SYNC_UPDATE = StageInfo(name="POST_SYNC_UPDATE", step=5)
SWAP = StageInfo(name="SWAP", step=6)
CLEAN_UP = StageInfo(name="CLEAN_UP", step=7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect adding a stage and renumbering the two later stages might cause issues for any Psycopack operations that are currently underway. This would be difficult to handle, so we should caution users to complete any Psycopack conversions before updating to this new version of Psycopack with the new CHANGE_LOG strategy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Worth putting a note on the CHANGELOG to alert that.


# But it is in the change log.
cur.execute(f"SELECT * FROM {repack.change_log};")
assert cur.fetchall() == [(1, 101), (3, 9999), (5, 102)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created_row_id is 102, right? Should we assert that at some point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it is the only insert in the table post the assertion above.
I added the id assertion in this fixup: ae892fd

Comment on lines 671 to 673
# The change log trigger and function have already been
# dropped during the schema sync stage. The table is the
# only artefact remaining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (copied from the clean_up method) doesn't apply here. The code below correctly drops the trigger and function, since reset could be called before the schema sync stage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixup: a30ee16

@timb07 timb07 requested a review from a team January 9, 2026 00:44
@marcelofern marcelofern force-pushed the pg-repack-backfill-and-swap-strategy branch from feb83ec to a426df9 Compare January 9, 2026 03:08
This commit starts the work to allow Psycopack to perform the Sync
Schema stage without CONCURRENT statements.

This will allow the Sync Schema stage to run to completion without being
blocked by long-running transactions.

For that to happen, data synchronisation from parent to copy must be
aided by a separated log table that tracks rows from the parent table
that have changed (thus CHANGE_LOG strategy).

This commit introduces a new argument to the Psycopack class and the
enum itself, these changes in specific don't do anything meaningful yet.

A base test is also introduced to be iterated over in future changes.
In subsequent changes there will be another type of trigger needed to
service SyncStrategy.CHANGES_LOG. That will be a trigger from the source
table to the changes log table.

This change specialises the name to avoid confusion in terminology.
This change adds four new columns to the Registry table:

- sync_strategy
- change_log
- change_log_function
- change_log_trigger

The change also handles updating existing Registry tables for users of
Psycopack who already started processing their tables so that they can
also get the update for these fields.

This is part of a larger work to allow Psycopack to have a new data and
schema synchronisation strategy that uses a change log table.
- Change log table: Keeps a record of writes against the table. It only
  stores the primary key of the source table row that changed.
- Function: Used by the trigger to store the PK that changed in the
  change log table.
- Trigger: pulls changes from the source table into the change log using
  the function above.
- Copy function: Used to copy data from the change log onto the copy
  table
The new synchronisation strategy requires a new step after synchronising
the schema so that all the changes that happened in the original table
since the Psycopack process began can be processed through.

This commit only adds the changes required by the tracker, leaving the
implementation of this stage to a subsequent commit.
The post_sync_update stage is responsible for processing rows that have
changed in the original table since the Psycopack setup stage began.

These rows are pushed into the copy table using an idempotent process
that:

- Locks rows from the original and copy table to be backfilled so that
  no changes happen to them while they are being backfilled to the copy
  table.
- Lock the assigned rows from the change log table too, so that they
  aren't changed in the interim.
- Delete any rows from the copy table that match the rows being
  processed, this is to handle inserts, updates, and deletes
  idempotently.
- Insert into the copy table the exact row from the original table.
If the strategy is CHANGE_LOG, we can put back the trigger to copy data
from the source table into the copy table and remove the change_log
table trigger.

This will ensure the change_log does not grow forever, which would be
tricky to deal with in code. This also makes the swap and revert_swap
process trivial, as there won't be any differences between the
CHANGE_LOG and the DIRECT_TRIGGER strategies.
When using the CHANGE_LOG strategy, the copy table has no hard
dependency on the source table and therefore doesn't need to create
indexes CONCURRENTLY.
@marcelofern marcelofern force-pushed the pg-repack-backfill-and-swap-strategy branch from a30ee16 to 41190a3 Compare January 9, 2026 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants