Skip to content

feat(gateway): add GET /apps/{id}/belongs-to discovery endpoint#398

Merged
bburda merged 16 commits into
mainfrom
feat/issue-196-app-belongs-to
May 14, 2026
Merged

feat(gateway): add GET /apps/{id}/belongs-to discovery endpoint#398
bburda merged 16 commits into
mainfrom
feat/issue-196-app-belongs-to

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented May 13, 2026

Summary

Implements GET /api/v1/apps/{app-id}/belongs-to. Returns the area that
contains an app, resolved via the app's parent component and that
component's area assignment.

The belongs-to resource was already listed in app capabilities; this PR
wires it up with a real handler, route, capability URI on GET /apps/{id},
unit + integration tests, and the matching REQ_INTEROP_106 spec entry.

Pattern mirrors the existing relationship endpoints (is-located-on,
depends-on): collection wrapper with 0 or 1 element, x-medkit.missing=true
on the item when an area id cannot be resolved.


Issue


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

  • Unit: 6 new cases in test_discovery_handlers.cpp covering happy path,
    no-component, no-area, missing-area, invalid id, unknown app. Plus
    belongs-to URI assertion in the existing GetAppReturnsLinksAndCapabilities
    and supports_resource("belongs-to") for the APP capability set.
  • Integration: 2 new cases in test_hateoas.test.py (happy path against
    demo manifest, 404 for unknown app).
  • Full local unit suite (2130 tests): green.
  • Linters (clang-format, ament_copyright, cpplint, etc.) + clang-tidy on
    changed files: clean.
  • Sphinx docs build: clean apart from the unrelated auto-generated
    verification.rst path.

Also bumps the TriggerManagerTest.LifetimeExpiry sleep margin in a
separate commit. The test created a 1 s trigger and slept 1500 ms; under
parallel CTest load the 500 ms margin past expiry was occasionally short
and the trigger still reported active. Bumped to 2500 ms.


Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Copilot AI review requested due to automatic review settings May 13, 2026 12:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new discovery relationship endpoint GET /api/v1/apps/{app-id}/belongs-to to resolve an App’s parent Area via its host Component, and wires it through routing, docs, and tests in the gateway.

Changes:

  • Implement DiscoveryHandlers::handle_app_belongs_to() and register the new route in RESTServer::setup_routes().
  • Extend discovery/HATEOAS expectations with unit + integration tests, and assert APP capability support for belongs-to.
  • Update REST API docs and requirements/spec traceability for REQ_INTEROP_106; adjust a flaky trigger-manager timing test.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ros2_medkit_integration_tests/test/features/test_hateoas.test.py Adds integration coverage for /apps/{id}/belongs-to (happy path + 404).
src/ros2_medkit_gateway/test/test_trigger_manager.cpp Increases the sleep margin in LifetimeExpiry to reduce CI flakiness.
src/ros2_medkit_gateway/test/test_entity_resource_model.cpp Adds an assertion that APP capabilities include belongs-to.
src/ros2_medkit_gateway/test/test_discovery_handlers.cpp Adds unit coverage for the new belongs-to handler and updates app detail expectations.
src/ros2_medkit_gateway/src/http/rest_server.cpp Registers the /belongs-to route for apps (OpenAPI + handler wiring).
src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp Implements the belongs-to handler and exposes the resource URI from GET /apps/{id}.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/core/http/handlers/discovery_handlers.hpp Declares the new handler entrypoint in DiscoveryHandlers.
docs/requirements/specs/discovery.rst Adds REQ_INTEROP_106 requirement entry for the new endpoint.
docs/api/rest.rst Documents GET /api/v1/apps/{app_id}/belongs-to behavior and example response.

Comment thread src/ros2_medkit_gateway/test/test_trigger_manager.cpp Outdated
Comment thread src/ros2_medkit_gateway/test/test_entity_resource_model.cpp
@bburda bburda self-assigned this May 13, 2026
@bburda bburda requested a review from Copilot May 13, 2026 15:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp
@bburda bburda requested a review from mfaferek93 May 13, 2026 17:51
bburda added a commit that referenced this pull request May 14, 2026
…head

The gateway shutdown sequence (mdns stop, REST server stop, transport
teardown, plugin manager shutdown, member destructors) completes well
under a second normally, but under TSan/ASan overhead on CI runners
it can exceed the launch_ros defaults (5s sigterm_timeout / 5s
sigkill_timeout). When that happens launch_testing escalates
SIGINT -> SIGTERM -> SIGKILL and the TestShutdown.test_exit_codes
assertion reports exit code -9 even though the gateway is mid-shutdown.

The TSan run for PR #398 hit exactly this on
test_scenario_thermal_protection: the gateway logged TriggerTopicSubscriber
shutdown complete and other steps but was still cleaning up when the
SIGKILL timer fired.

Set sigterm_timeout=30 and sigkill_timeout=15 in create_gateway_node so
every integration test that goes through this factory gets the longer
windows. The bigger numbers do not slow down healthy runs since the
process exits as soon as shutdown actually finishes.
Copy link
Copy Markdown
Collaborator

@mfaferek93 mfaferek93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments :)

Comment thread src/ros2_medkit_gateway/src/http/rest_server.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp
Comment thread src/ros2_medkit_gateway/test/test_discovery_handlers.cpp
Comment thread src/ros2_medkit_gateway/src/http/rest_server.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/discovery_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/test/test_trigger_manager.cpp
bburda added 10 commits May 14, 2026 17:28
Returns the area that contains an app, resolved transitively via the
app's parent component and that component's area assignment. Follows
the existing relationship-endpoint pattern (is-located-on, depends-on):
collection response wrapped in {"items": [...]} with 0 or 1 element.

The "belongs-to" resource was already declared in app capabilities; this
wires it through with a real handler, route registration, capability URI
on the app detail response, unit + integration tests, and REQ_INTEROP_106.

closes #196
Replace the fixed sleep in TriggerManagerTest.LifetimeExpiry with a
bounded polling loop. The test now exits as soon as is_active flips to
false on a fast machine (~1 s) and only sits up to the 5 s deadline if
the scheduler stalls under heavy parallel test load.

Also split AppSupportsIsLocatedOn into two cases so the name reflects
what each test asserts: AppSupportsIsLocatedOn and AppSupportsBelongsTo.
…lity

handle_get_app already added the top-level "belongs-to" URI when the app
has a parent component but did not also add it to the _links object.
That left the new relationship discoverable from the top-level field but
not from _links, while is-located-on and depends-on appear in both.

Mirror the existing pattern: gate the _links entry on the same
component_id condition and assert it in the unit test.
FaultManagerTest cases that called start_spinning() and then invoked
fault_manager.get_snapshots() or get_rosbag() did not join the spin
thread before letting test-scope locals run their destructors. Under
TSan this produced flaky "data race in operator delete" warnings where
the executor thread cleaning up the rclcpp client's pending request
overlapped with destruction of the result and local fixtures in the
main thread.

Add stop_spinning() right after each get_*() call so the spin thread
is cancelled and joined before the test reads from result or anything
goes out of scope. Mirrors the pattern already used by the notify and
trigger-event tests in this file.
The "Upload coverage to Codecov" step runs in an ubuntu:noble container
that does not include gpg. codecov/codecov-action@v5 runs a dependency
check that requires gpg and exits with code 1 when it is missing,
failing the coverage job on every push to main since the step is gated
on github.ref == 'refs/heads/main'. Add gpg to the apt-get install
line so the action can validate uploader signatures.
…head

The gateway shutdown sequence (mdns stop, REST server stop, transport
teardown, plugin manager shutdown, member destructors) completes well
under a second normally, but under TSan/ASan overhead on CI runners
it can exceed the launch_ros defaults (5s sigterm_timeout / 5s
sigkill_timeout). When that happens launch_testing escalates
SIGINT -> SIGTERM -> SIGKILL and the TestShutdown.test_exit_codes
assertion reports exit code -9 even though the gateway is mid-shutdown.

The TSan run for PR #398 hit exactly this on
test_scenario_thermal_protection: the gateway logged TriggerTopicSubscriber
shutdown complete and other steps but was still cleaning up when the
SIGKILL timer fired.

Set sigterm_timeout=30 and sigkill_timeout=15 in create_gateway_node so
every integration test that goes through this factory gets the longer
windows. The bigger numbers do not slow down healthy runs since the
process exits as soon as shutdown actually finishes.
Three gateway plugins held rclcpp::Node or rclcpp::Subscription members
and called shutdown() from a non-noexcept destructor. On Rolling rclcpp
the destructor of those resources throws graph_listener::NodeNotFoundError
when the rclcpp context has already been invalidated by rclcpp::shutdown()
(which is exactly the situation during late teardown). An exception
escaping a destructor invokes std::terminate(), which is what
test_beacon_param TestShutdown.test_exit_codes saw as exit code -6 on
the rolling job.

Plugin_manager::shutdown_all() already catches exceptions from
plugin->shutdown(), but that does not help when the plugin destructor
itself runs the throw (either by calling shutdown() again or by member
auto-destruction once the explicit reset() failed).

Fix in three plugins (parameter_beacon, topic_beacon, graph_provider):

1. Mark the destructor noexcept and wrap the shutdown() call in
   try-catch so nothing escapes the destructor.
2. Wrap the rclcpp resource reset() inside shutdown() in try-catch so
   the first shutdown attempt (driven by plugin_manager) drops the
   resource even when rclcpp throws on tear-down.

Lifted only the destruction sites; the rest of shutdown() and the
normal happy path are unchanged.
auth_config.cpp had no entries for apps/*/belongs-to or apps/*/is-located-on
under any of viewer / operator / configurator. The apps/* wildcard in the
catch-all entry is single-segment (auth_requirement_policy maps * to [^/]+),
so it does not cover the relationship subresources. Only admin had a
matching apps/** fallback, which meant every other role hit 403 on these
endpoints.

Add the two new entries for each role next to the existing apps/*/depends-on
entry. is-located-on is pre-existing - this also fixes the same gap that
shipped with REQ_INTEROP_105.
handle_app_belongs_to silently returned items=[] when the app's
component_id pointed to a component the discovery layer could not resolve.
HATEOAS clients then could not tell 'app legitimately has no parent area'
from 'manifest broken, parent component gone'.

Add the same symmetric branch handle_app_is_located_on already has:
emit a single item with x-medkit.missing=true and an x-medkit
extension field carrying the dangling component id, plus a RCLCPP_WARN.
A component that resolves but has no area_id assigned still produces
items=[] - that is a valid manifest configuration, not a broken ref.

Also tighten the chain traversal with an atomic ThreadSafeEntityCache
snapshot so the three lookups (app, component, area) cannot observe
mixed cache generations under a concurrent refresh, and switch the
three app relationship handlers to validate_entity_for_route so remote
apps in an aggregation cluster get forwarded instead of 404.

Mirror unit test AppIsLocatedOnReturnsMissingItemWhenHostComponentUnresolved.
…shot

GET /apps/{id} exposes the available subresources on three parallel
surfaces (top-level URI references, _links, and the capabilities array);
the belongs-to addition only updated two of them. App::to_capabilities,
used by notifications / snapshots / change events, also lacked the link.
Brought all four sites into agreement by adding BELONGS_TO and
IS_LOCATED_ON entries to CapabilityBuilder::Capability (IS_LOCATED_ON
was already missing - boy-scouted along with the new one), pushing them
into the caps vector under the same component_id gate as the existing
URI / _links emission, and adding belongs-to to App::to_capabilities.

SOVD ISO 17978-3 §7.6 defines /belongs-to only for /apps. The previous
EntityCapabilities table advertised belongs-to for SERVER and COMPONENT
too, so supports_resource("belongs-to") returned true on those types and
clients hit 404 when following the advertised resource. Dropped both
entries from resources_; SERVER and COMPONENT keep their other
relationship resources.

Add an atomic ThreadSafeEntityCache::get_app_with_links() that resolves
an App together with its parent Component and Area under a single
shared_lock. Three sequential get_* calls each take a fresh shared_lock,
which lets a writer refresh advance the generation between them and
return a mixed-generation view (app from gen N, component from N+1,
area from N). The handler refactor is in the previous commit; this one
adds the method and its unit test coverage via the handler suite.

Document on docs/api/rest.rst that standalone apps deliberately do not
expose the belongs-to URI - SOVD §7.6 says "Each URI reference is only
present if the entity supports that collection". HATEOAS-only clients
must rely on absence of the link.

Switch test_hateoas integration tests for the 200 paths to
self.poll_endpoint(), which adds retry under sanitizer / CI load instead
of the raw requests.get(timeout=10) used before. Update the 404 shape to
the generic 'Entity not found' / parameters.entity_id payload emitted by
validate_entity_for_route (the relationship handlers now go through it
so remote apps in an aggregation cluster get forwarded).
@bburda bburda force-pushed the feat/issue-196-app-belongs-to branch from eb3ae1c to 97c654c Compare May 14, 2026 15:41
bburda added 6 commits May 14, 2026 18:12
The :ref:`sovd-api-discovery` target lives in the separate knowledge/
Sphinx project and is not reachable from ros2_medkit's docs build, so
sphinx-build -W rejected it as an undefined label. Keep the ISO 17978-3
§7.6 citation in prose; consumers can look the section up directly.
… cases

handle_app_depends_on iterates app.depends_on and resolves each entry
via cache.get_app(). Each call takes a fresh shared_lock, which lets a
writer refresh advance the cache generation between the app fetch and
the N per-dependency fetches - a 5-dependency app could return 5
dependencies from 5 different generations.

Add ThreadSafeEntityCache::get_app_with_dependencies() that resolves the
app plus every entry in its depends_on under a single shared_lock, and
switch the handler to use it. Discovery-manager fallback preserves the
old shape (no atomic guarantee, but rare path used when the entity isn't
cached yet).

Seed the demo manifest with three HATEOAS edge-case fixtures
(hateoas-edge-area, hateoas-component-no-area, hateoas-app-standalone,
hateoas-app-on-area-less-component) and add the matching integration
tests so the standalone-app and component-without-area branches of
/belongs-to and /is-located-on are exercised end-to-end, not just in
unit tests.

The 'unresolved component' and 'dangling area' branches are not seeded
in the integration manifest because validators R007 (app -> missing
component) and R006 (component -> missing area) reject the manifest and
trigger a fallback to runtime_only discovery, which disables the rest
of the suite. Both branches remain covered by the unit tests
AppBelongsToReturnsMissingItemWhenParentComponentUnresolved and
AppBelongsToReturnsMissingItemWhenAreaUnresolved.
… count assertions

The new hateoas-edge-area, hateoas-component-no-area, hateoas-app-standalone
and hateoas-app-on-area-less-component manifest entries (added to support
the standalone-app / area-less integration tests in test_hateoas) tripped
the exact-count assertions in test_hybrid_suppression and the
allow-list-style checks in test_discovery_gap_fill. Both suites load the
same demo_nodes_manifest.yaml.

Extend MANIFEST_ALL_AREAS, MANIFEST_TOP_LEVEL_AREAS, MANIFEST_COMPONENTS
and MANIFEST_APPS in test_hybrid_suppression to include the new ids, plus
add MANIFEST_APPS_WITH_BINDING so the test_manifest_apps_are_online check
skips the manifest-only fixtures (no ros_binding means they never reach
is_online=True). Mirror the allow-list updates in test_discovery_gap_fill.
…fixtures

test_scenario_discovery_manifest::test_25_no_topic_components_in_manifest_only
hard-codes the set of expected component IDs and rejects any others, so
the hateoas-component-no-area fixture (added in dfc4dae to support the
new integration tests for the standalone / area-less branches) tripped
it under manifest_only mode. Add the fixture id to the allow-list.
ASan job hit 'ld: final link failed: No space left on device' three
runs in a row on the c220212 CI cycle, deterministically aborting at
the test_plugin_entity_routing / test_plugin_manager link step. The
github-hosted runner gives the container ~14 GB on /, and the
combination of Debug (-O0 -g) plus ASan + UBSan instrumentation
produces enough object-file/debug-info bloat to exhaust that even with
ccache warm.

- Switch -DCMAKE_BUILD_TYPE=Debug to -DCMAKE_BUILD_TYPE=RelWithDebInfo
  for sanitizer-asan, matching what sanitizer-tsan already uses. The
  ROS2MedkitSanitizers cmake module overrides optimisation to -O1 in
  both cases, so this only drops the -O0 expansion and keeps debug
  info. ASan/UBSan stack traces stay readable.
- Extend the Free disk space step with /var/cache/apt, /var/lib/apt/lists,
  /usr/share/doc, /usr/share/man, /usr/share/locale cleanup inside the
  container, plus df -h before and after the build for diagnostics.

No source change - the test/build behaviour is preserved; only the
disk footprint shrinks.
…apt install

The aggressive disk cleanup in the prior commit rm'd /usr/share/doc,
/usr/share/man and /usr/share/locale. dpkg postinstall scripts for
ros-jazzy-desktop, openjdk-21, libvtk9 and libpcl-dev fail when those
trees are missing - the ASan job aborted after 5 min in 'Set up ROS 2
Jazzy' with dpkg returning 100 on every dependency.

Roll those three rm -rf paths back. Keep the RelWithDebInfo build-type
change (which was the actual fix for the ASan disk-space failure) and
the df -h diagnostic.
@mfaferek93 mfaferek93 self-requested a review May 14, 2026 20:49
@bburda bburda merged commit aeac123 into main May 14, 2026
14 checks passed
bburda added a commit that referenced this pull request May 14, 2026
…head

The gateway shutdown sequence (mdns stop, REST server stop, transport
teardown, plugin manager shutdown, member destructors) completes well
under a second normally, but under TSan/ASan overhead on CI runners
it can exceed the launch_ros defaults (5s sigterm_timeout / 5s
sigkill_timeout). When that happens launch_testing escalates
SIGINT -> SIGTERM -> SIGKILL and the TestShutdown.test_exit_codes
assertion reports exit code -9 even though the gateway is mid-shutdown.

The TSan run for PR #398 hit exactly this on
test_scenario_thermal_protection: the gateway logged TriggerTopicSubscriber
shutdown complete and other steps but was still cleaning up when the
SIGKILL timer fired.

Set sigterm_timeout=30 and sigkill_timeout=15 in create_gateway_node so
every integration test that goes through this factory gets the longer
windows. The bigger numbers do not slow down healthy runs since the
process exits as soon as shutdown actually finishes.
@bburda bburda deleted the feat/issue-196-app-belongs-to branch May 14, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement GET /apps/{app-id}/belongs-to endpoint

3 participants