Introduce CHANGE_LOG strategy #54

marcelofern · 2026-01-05T23:20:01Z

Background

The way Psycopack works today is by having a trigger to synchronise data
between the source table and the copy table.

One caveat here is that the trigger itself establishes a hard link between the
source table and the copy table.

Effectivelly this means that applying any hard locks on the the copy table
would practically lock the source table as well.

Due to this constraint, the sync schemas stage on Psycopack relies on DDLs that
don't lock the copy table to avoid blockage on the source table too.

This ends up having an effect on how Psycopack creates indexes on the copy
table, as such indexes have to be created using CONCURRENTLY to avoid taking an
access exclusive lock on the copy table (which would consequently block the
trigger on the source table of operating).

Due to the way concurrent index creation works in Postgres, an index is not
valid until all transactions that were alive before the index creation ddl
fired are finished.

This means that environments using Psycopack that happen to have long-running
transactions would effectively delay the sync schema stage until those old
transactions are finished.

Solution (this change)

This pull request introduces a new back-end strategy for synchronising data
between the source table and the copy table that does not involve a direct link
between these two tables.

This new concept is defined as a "sync strategy" (for data synchronisation).

The new sync strategy is called CHANGE LOG, as it involves the creation of an
intermediary table where changes from the source table are kept in, before the
two tables are fully synchronised in terms of data. This intermediary table is
called the "change log".

Having a change log, means that the sync schema stage can operate without
CONCURRENT index creation, such that this stage doesn't get blocked of advanced
due to long-running transactions.

In order to guarantee perfect data synchronisation between source and copy
tables, the direct trigger from src to copy is again added, but only after the
schema sync has succeeded.

A high-level explanation of what the new functionality on each stage of
Psycopack when using the CHANGE LOG sync strategy are as follow:

setup: the change log table is created here, and instead of having a trigger
from the source table to the copy table, the trigger goes into the change log
table. Any rows updated/created/deleted on the source table will be reflected
in the change log table effectively from now.
backfill: no changes here, backfill still works in batches.
sync_schema: instead of creating indexes CONCURRENTLY, simply create indexes
in place. Once the schema is updated, drop the change log trigger and create
a direct trigger from src to copy. This guarantees that the change log
doesn't grow forever, and that any new writes go straight to the copy table
instead.
Post sync update [new stage]: This stage reads the change log, and perform
updates on the copy table based on rows that changed between the setup and
sync_schema stages..
Swap: nothing changes
Clean up: the change_log table is dropped here.

github-actions · 2026-01-05T23:21:15Z

Coverage Report Results

Name	Stmts	Miss	Branch	BrPart	Cover
src/psycopack/__init__.py	8	0	0	0	100%
src/psycopack/_commands.py	130	0	6	0	100%
src/psycopack/_conn.py	5	0	0	0	100%
src/psycopack/_const.py	3	0	0	0	100%
src/psycopack/_cur.py	24	0	2	0	100%
src/psycopack/_identifiers.py	12	0	2	0	100%
src/psycopack/_introspect.py	183	0	18	0	100%
src/psycopack/_logging.py	2	0	0	0	100%
src/psycopack/_psycopg.py	5	0	0	0	100%
src/psycopack/_registry.py	88	0	12	0	100%
src/psycopack/_repack.py	445	3	148	3	99%
src/psycopack/_sync_strategy.py	4	0	0	0	100%
src/psycopack/_tracker.py	182	2	46	2	98%
tests/conftest.py	19	0	0	0	100%
tests/factories.py	41	0	10	0	100%
tests/test_cur.py	20	0	2	0	100%
tests/test_fixtures.py	5	0	0	0	100%
tests/test_package.py	3	0	0	0	100%
tests/test_repack.py	907	0	16	0	100%
TOTAL	2086	5	262	5	99%

1 empty file skipped.

timb07

Here are some initial comments. Overall, this looks like it should work! :)

I've tried to come up with scenarios of sequences of operations on the source table that span the backfill, sync schemas or post sync stages and somehow don't correctly get replicated to the copy table, but so far everything I've come up with should be handled correctly. :)

I'd like a chance to test this out locally before I give an approval.

timb07 · 2026-01-07T22:12:30Z

tests/test_repack.py

+        assert "change_log_trigger" in columns
+        assert "change_log" in columns


🐼 For completeness, we should assert that "change_log_function" and "change_log_copy_function" are in the list of columns as well.

Fixup: a426df9

timb07 · 2026-01-07T22:28:50Z

src/psycopack/_commands.py

+                  SELECT {columns}
+                  FROM {schema}.{table_from}
+                  WHERE {pk_column} = ANY (pks)
+                  ON CONFLICT DO NOTHING;


Given we've just deleted any rows in the destination table with PK matching any of the PKs in pks, how could we ever have a conflict here? Would it be better to remove the ON CONFLICT, and throw an error if we encounter this unexpected situation?

I agree that it would be better to raise an error. One scenario where this might occur is if the Psycopack user interfered in the process to add a new unique/exclusion constraint in the fly (after sync schemas), and now potentially re-inserting a row that has since changed in the source table in violation of the constraint would raise an error.

Fixup: 84fdc77

timb07 · 2026-01-07T22:35:03Z

tests/test_repack.py

+    "pk_type",
+    ("bigint", "bigserial", "integer", "serial", "smallint", "smallserial"),
+)
+def test_repack_with_changes_log_strategy(


🐼 typo: "..._change_log_..."

Fixup: 6d0da34

timb07 · 2026-01-07T22:41:04Z

src/psycopack/_tracker.py

+    POST_SYNC_UPDATE = StageInfo(name="POST_SYNC_UPDATE", step=5)
+    SWAP = StageInfo(name="SWAP", step=6)
+    CLEAN_UP = StageInfo(name="CLEAN_UP", step=7)


I suspect adding a stage and renumbering the two later stages might cause issues for any Psycopack operations that are currently underway. This would be difficult to handle, so we should caution users to complete any Psycopack conversions before updating to this new version of Psycopack with the new CHANGE_LOG strategy.

That's right. Worth putting a note on the CHANGELOG to alert that.

timb07 · 2026-01-08T23:39:30Z

tests/test_repack.py

+
+        # But it is in the change log.
+        cur.execute(f"SELECT * FROM {repack.change_log};")
+        assert cur.fetchall() == [(1, 101), (3, 9999), (5, 102)]


created_row_id is 102, right? Should we assert that at some point?

Correct, it is the only insert in the table post the assertion above.
I added the id assertion in this fixup: ae892fd

timb07 · 2026-01-09T00:26:52Z

src/psycopack/_repack.py

+                # The change log trigger and function have already been
+                # dropped during the schema sync stage. The table is the
+                # only artefact remaining.


This comment (copied from the clean_up method) doesn't apply here. The code below correctly drops the trigger and function, since reset could be called before the schema sync stage.

Good catch. Fixup: a30ee16

This commit starts the work to allow Psycopack to perform the Sync Schema stage without CONCURRENT statements. This will allow the Sync Schema stage to run to completion without being blocked by long-running transactions. For that to happen, data synchronisation from parent to copy must be aided by a separated log table that tracks rows from the parent table that have changed (thus CHANGE_LOG strategy). This commit introduces a new argument to the Psycopack class and the enum itself, these changes in specific don't do anything meaningful yet. A base test is also introduced to be iterated over in future changes.

In subsequent changes there will be another type of trigger needed to service SyncStrategy.CHANGES_LOG. That will be a trigger from the source table to the changes log table. This change specialises the name to avoid confusion in terminology.

This change adds four new columns to the Registry table: - sync_strategy - change_log - change_log_function - change_log_trigger The change also handles updating existing Registry tables for users of Psycopack who already started processing their tables so that they can also get the update for these fields. This is part of a larger work to allow Psycopack to have a new data and schema synchronisation strategy that uses a change log table.

- Change log table: Keeps a record of writes against the table. It only stores the primary key of the source table row that changed. - Function: Used by the trigger to store the PK that changed in the change log table. - Trigger: pulls changes from the source table into the change log using the function above. - Copy function: Used to copy data from the change log onto the copy table

The new synchronisation strategy requires a new step after synchronising the schema so that all the changes that happened in the original table since the Psycopack process began can be processed through. This commit only adds the changes required by the tracker, leaving the implementation of this stage to a subsequent commit.

The post_sync_update stage is responsible for processing rows that have changed in the original table since the Psycopack setup stage began. These rows are pushed into the copy table using an idempotent process that: - Locks rows from the original and copy table to be backfilled so that no changes happen to them while they are being backfilled to the copy table. - Lock the assigned rows from the change log table too, so that they aren't changed in the interim. - Delete any rows from the copy table that match the rows being processed, this is to handle inserts, updates, and deletes idempotently. - Insert into the copy table the exact row from the original table.

If the strategy is CHANGE_LOG, we can put back the trigger to copy data from the source table into the copy table and remove the change_log table trigger. This will ensure the change_log does not grow forever, which would be tricky to deal with in code. This also makes the swap and revert_swap process trivial, as there won't be any differences between the CHANGE_LOG and the DIRECT_TRIGGER strategies.

When using the CHANGE_LOG strategy, the copy table has no hard dependency on the source table and therefore doesn't need to create indexes CONCURRENTLY.

marcelofern requested a review from a team as a code owner January 5, 2026 23:20

timb07 reviewed Jan 9, 2026

View reviewed changes

timb07 requested a review from a team January 9, 2026 00:44

marcelofern force-pushed the pg-repack-backfill-and-swap-strategy branch from feb83ec to a426df9 Compare January 9, 2026 03:08

marcelofern added 10 commits January 9, 2026 16:32

Rename create_copy_trigger

05fb2f1

In subsequent changes there will be another type of trigger needed to service SyncStrategy.CHANGES_LOG. That will be a trigger from the source table to the changes log table. This change specialises the name to avoid confusion in terminology.

Make CONCURRENTLY optional when creating indexes

2d4856a

When using the CHANGE_LOG strategy, the copy table has no hard dependency on the source table and therefore doesn't need to create indexes CONCURRENTLY.

Update clean_up to drop change_log table

70cc87d

Ensure reset drops change log artefacts

41190a3

marcelofern force-pushed the pg-repack-backfill-and-swap-strategy branch from a30ee16 to 41190a3 Compare January 9, 2026 03:32

		assert "change_log_trigger" in columns
		assert "change_log" in columns

Introduce CHANGE_LOG strategy #54

Are you sure you want to change the base?

Introduce CHANGE_LOG strategy #54

Uh oh!

Conversation

marcelofern commented Jan 5, 2026

Background

Solution (this change)

Uh oh!

github-actions bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report Results

Uh oh!

timb07 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 5, 2026 •

edited

Loading