Skip to content

fix(gateway): drive fault transport with its own executor#402

Open
bburda wants to merge 2 commits into
mainfrom
fix/issue-399-fault-transport-spin-until-complete
Open

fix(gateway): drive fault transport with its own executor#402
bburda wants to merge 2 commits into
mainfrom
fix/issue-399-fault-transport-spin-until-complete

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented May 15, 2026

Summary

Refactors Ros2FaultServiceTransport to drive its rclcpp service clients with
a private SingleThreadedExecutor and spin_until_future_complete() instead
of blocking on a std::future::wait_for() that requires some external thread
to spin the host node.

The seven service clients are now created against a dedicated
MutuallyExclusive callback group that is explicitly not auto-attached to
the host node's executor. The transport owns the executor that drives this
group, so each RPC's client callback runs synchronously on the caller's
thread inside spin_until_future_complete. The pending-request cleanup and
the response destruction therefore happen sequentially on the same thread,
which eliminates the TSan-reported race between the gateway's spin thread
calling rclcpp::Executor::execute_client and the caller destroying the
local response shared_ptr at the end of the function.

The seven per-method mutexes collapse into one executor_mutex_ -
SingleThreadedExecutor::spin_until_future_complete() is not safe for
concurrent callers on the same executor, so all RPCs serialise through it.
This is the explicit tradeoff the issue accepts in exchange for losing the
external-spin-thread dependency.

Issue

Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

  • colcon build --packages-select ros2_medkit_gateway clean
  • colcon test --packages-select ros2_medkit_gateway --ctest-args -LE "linter|integration":
    2044 tests / 0 failures
  • test_fault_manager (18 cases) passes - covers the previously racy
    get_snapshots / get_rosbag paths and the namespaced variants
  • pre-commit hooks pass (clang-format, ament-copyright, no-naked-subscriptions,
    clang-tidy on changed files)
  • TSan was not run locally (WSL2 ASLR cannot be relaxed for the TSan runtime);
    the CI sanitiser jobs will validate the race is gone

The callback group is created once at transport construction (single-threaded)
and stays on the transport's private executor. The rcl hash-map race that
check_no_naked_subscriptions.sh guards against does not apply, so the file
is added to ALLOWED_LEGACY_FILES with a note referencing this issue.

Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Ros2FaultServiceTransport::get_snapshots / get_rosbag (and the other RPC
methods) previously called rclcpp::Client::async_send_request and blocked
on the std::future. That required some external executor to be spinning
the host node, and TSan caught a race between the spin thread cleaning up
the client's pending request entry (execute_client) and the calling thread
destroying the local response shared_ptr when the function returned.

Give the transport its own MutuallyExclusive callback group, registered
on a private SingleThreadedExecutor, and drive each RPC via
executor_.spin_until_future_complete(). The client callback execution and
the response destruction now happen sequentially on the caller's thread,
which removes the cross-thread race entirely.

The seven per-method mutexes collapse into one executor mutex - the
underlying SingleThreadedExecutor is not safe for concurrent
spin_until_future_complete callers, so all RPCs serialise through it. The
host node's executor no longer needs to know about these clients.

The new callback group is created once at construction (single-threaded)
and stays on the transport's private executor, so the rcl hash-map race
that the no-naked-subscriptions gate guards against does not apply. Add
the file to ALLOWED_LEGACY_FILES with a note pointing back at this issue.

Closes #399
Copilot AI review requested due to automatic review settings May 15, 2026 07:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Ros2FaultServiceTransport to remove the dependency on an externally-spun node executor by driving fault-manager service clients via a transport-owned SingleThreadedExecutor and synchronous spin_until_future_complete(), addressing the TSan race described in issue #399.

Changes:

  • Create all 7 fault-manager service clients in a dedicated MutuallyExclusive callback group that is not auto-attached to the host node executor.
  • Add a private SingleThreadedExecutor in Ros2FaultServiceTransport and serialize all RPCs through an executor_mutex_ while synchronously spinning futures.
  • Allowlist the transport source file in check_no_naked_subscriptions.sh due to its create_callback_group() usage (with issue reference).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cpp Moves fault-manager RPC execution to a private executor + synchronous spinning; collapses per-method locks into a single executor lock.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2/transports/ros2_fault_service_transport.hpp Updates transport documentation and adds members for callback group, private executor, and executor mutex.
scripts/check_no_naked_subscriptions.sh Adds the transport file to the allowlist to account for intentional callback-group creation (issue #399).

Comment thread src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cpp Outdated
Comment thread scripts/check_no_naked_subscriptions.sh
@bburda bburda self-assigned this May 15, 2026
Build the seven fault service clients on a private rclcpp::Node owned by
Ros2FaultServiceTransport instead of a dedicated callback group on the host
node. The private node is driven by the transport's own SingleThreadedExecutor
via spin_until_future_complete(), so the host gateway node's executor never
processes these clients and the client callback / response destruction stay
on the caller's thread.

This avoids the create_client(name, rclcpp::QoS, group) overload, which does
not exist on Humble (rclcpp there only ships the rmw_qos_profile_t form), and
keeps the no-naked-subscriptions regression gate satisfied without an
allowlist entry, since create_client and create_node are not gated APIs.

All seven RPC methods now share one consistent shape: build the request
unlocked, then under executor_mutex_ wait for the service, send, spin to
completion and take the response; response translation happens unlocked.

Adds a regression test that completes a get_snapshots RPC while the host
node is never spun, proving the transport drives its private client itself.
@bburda bburda requested a review from mfaferek93 May 15, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Ros2FaultServiceTransport to use spin_until_future_complete to remove client/spin-thread race

2 participants