Skip to content

Conversation

@fkomauli
Copy link

The primary goal of this PR is to implement a mechanism for locking dynamic shared memory (DSM) provided to worker processes during a parallel foreign scan.

The proposed solution is based on three layers:

  • a low level Rust wrapper over PostgreSQL LWLock, not tied to SHMEM hooks;
  • a mechanism for requesting a LWLock tranche on startup, stored as a static global, based on the existing PgSharedMemoryInitialization trait and the pg_shmem_init! macro;
  • high level components that are designed to be straightforwardly used in PostgreSQL FDW routines (pgrx_*_foreign_scan).

Tests

This PR implements the same test coverage of PgLwLock for the new lock mechanisms. I didn't find a clean way to test BgWorkers and shared memory using the introduced lock types. Instead, I provided an example implementation in pgrx-examples/parallel_scan_lwlock that allows executing parallel scans, whose workers access a shared resource (a naive counter) as the source of tuples for foreign tables.

If you have any suggestion, I'll proceed to integrate the test suite.

@eeeebbbbrrrr
Copy link
Contributor

Sorry for the delay. I didn't realize this was waiting for CI approval. Lets see what happens!

@fkomauli
Copy link
Author

Run "startup failure"?

@eeeebbbbrrrr
Copy link
Contributor

You might need to push something to this PR in order to get the tests to run. Something didn't work like it's supposed to.

@fkomauli fkomauli force-pushed the dsm-locks branch 2 times, most recently from 15ae6cd to eac70c9 Compare October 20, 2025 09:32
@fkomauli
Copy link
Author

Fixed some cargo fmt issues, it still requires workflow approval to be run

@fkomauli
Copy link
Author

I took some time to set up a Windows Server 2022 VM with all the required toolchain. I haven't been able to reproduce the test failure occurred in the GitHub Actions:

image

I'm still running multiple times the test, hoping to trigger a non-deterministic failure.

cargo test --all --no-default-features --features "pg17 pg_test cshim proptest" --all-targets -- dsm_test_lock_is_released_on_unwind

Test Environment

I'm not sure how the windows-2022 image used by the GHA is configured, therefore my local configuration may differ. These are the system info where the PR is being tested on my VM, running in QEMU/KVM on an Ubuntu 24.04 with 6.14.0-33-generic linux kernel (sorry for the screenshots, didn't manage to install guest extensions for copy/pasting):

System

image

Visual Studio Build Tools 2019 - 16.11.52

image

Rust toolchain

image

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be rebased.

@fkomauli
Copy link
Author

Test failure confirmed (this time for pg13, on previous run it failed with pg17 first). I'll investigate it further, thanks for your support

test tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind has been running for over 60 seconds
test tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind ... FAILED

failures:

---- tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind stdout ----

thread 'tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind' panicked at pgrx-tests\src\framework.rs:169:9:


Postgres Messages:
[2025-10-30 20:56:02.915 UTC] [2160] [6903d0e2.870]: LOG:  starting PostgreSQL 13.22, compiled by Visual C++ build 1944, 64-bit
[2025-10-30 20:56:02.918 UTC] [2160] [6903d0e2.870]: LOG:  listening on IPv6 address "::1", port 32213
[2025-10-30 20:56:02.918 UTC] [2160] [6903d0e2.870]: LOG:  listening on IPv4 address "127.0.0.1", port 32213
[2025-10-30 20:56:02.920 UTC] [2160] [6903d0e2.870]: LOG:  listening on Unix socket "D:/a/pgrx/pgrx/target/test-pgdata/.s.PGSQL.32213"
[2025-10-30 20:56:02.955 UTC] [2160] [6903d0e2.870]: LOG:  database system is ready to accept connections
[2025-10-30 20:58:39.688 UTC] [2160] [6903d0e2.870]: LOG:  server process (PID 6944) was terminated by exception 0xC0000409
[2025-10-30 20:58:39.688 UTC] [2160] [6903d0e2.870]: DETAIL:  Failed process was running: SELECT "tests"."dsm_test_lock_is_released_on_unwind"();
[2025-10-30 20:58:39.688 UTC] [2160] [6903d0e2.870]: HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
[2025-10-30 20:58:39.688 UTC] [2160] [6903d0e2.870]: LOG:  terminating any other active server processes
[2025-10-30 20:58:39.694 UTC] [2160] [6903d0e2.870]: LOG:  all server processes terminated; reinitializing


Test Function Messages:
[2025-10-30 20:56:10.827 UTC] [6944] [6903d0ea.1b20]: LOG:  statement: START TRANSACTION
[2025-10-30 20:56:10.827 UTC] [6944] [6903d0ea.1b20]: LOG:  statement: SELECT "tests"."dsm_test_lock_is_released_on_unwind"();
[2025-10-30 20:58:39.322 UTC] [6944] [6903d0ea.1b20]: PANIC:  stuck spinlock detected at LWLockWaitListLock, D:\a\postgresql-packaging-foundation\postgresql-packaging-foundation\postgresql-13.22\src\backend\storage\lmgr\lwlock.c:918
[2025-10-30 20:58:39.322 UTC] [6944] [6903d0ea.1b20]: STATEMENT:  SELECT "tests"."dsm_test_lock_is_released_on_unwind"();


Client Error:
stuck spinlock detected at LWLockWaitListLock, D:\a\postgresql-packaging-foundation\postgresql-packaging-foundation\postgresql-13.22\src\backend\storage\lmgr\lwlock.c:918
postgres location: D:\a\postgresql-packaging-foundation\postgresql-packaging-foundation\postgresql-13.22\src\backend\storage\lmgr\s_lock.c:83
rust location: <unknown>


stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/1159e78c4747b02ef[996](https://github.com/pgcentralfoundation/pgrx/actions/runs/18954271881/job/54127887101?pr=2141#step:12:997)e55082b704c09b970588/library\std\src\panicking.rs:697
   1: core::panicking::panic_fmt
             at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library\core\src\panicking.rs:75
   2: pgrx_tests::framework::run_test
             at .\src\framework.rs:169
   3: pgrx_tests::tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind
             at .\src\tests\shmem_tests.rs:172
   4: pgrx_tests::tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind::closure$0
             at .\src\tests\shmem_tests.rs:172
   5: core::ops::function::FnOnce::call_once<pgrx_tests::tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind::closure_env$0,tuple$<> >
             at /rustc/1159e78c4747b02ef996e55082b704c09b970588\library\core\src\ops\function.rs:253
   6: core::ops::function::FnOnce::call_once
             at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library\core\src\ops\function.rs:253
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    tests::shmem_tests::tests::pg_dsm_test_lock_is_released_on_unwind

test result: FAILED. 518 passed; 1 failed; 1 ignored; 0 measured; 0 filtered out; finished in 209.07s

@workingjubilee
Copy link
Member

On Windows, the "unwind" and "longjmp" mechanics are identical, which results in some irregularities. It may be best to account for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants