Skip to content

Conversation

@garlick
Copy link
Member

@garlick garlick commented Dec 2, 2025

Problem: #7109 moved all event publishing from the broker to the overlay subsystem so that the peer multicast portion could eventually run in separate thread. Unfortunately, moving all event publication machinery to overlay creates an unnecessary requirement that overlay be running on a size=1 instance, and will complicate flux module reload overlay, should that be useful.

Leave the resource intensive event mcast code in overlay but move the rest of it back into the broker.

TL;DR

Recall there are two ways that event messages are routed per RFC 3:

  1. A bare event message is forwarded upstream on the TBON and published when it arrives on rank 0.

  2. An event message is base64 encoded and encapsulated in a request message that is sent to rank 0, where it is published.

Publication occurs only on rank 0 and consists of assigning a monotonically increasing sequence number, distributing the message to local broker and module subscribers, and sending the event message to the overlay via the interthread message channel for further distribution.

The overlay now routes events as follows:

  • If received from the local broker on the interthread channel: On rank 0, messages are mcast to all children. On rank > 0, messages are forwarded to the parent for publication.

  • If received from an overlay child: On rank 0, messages are forwarded to the local broker for publication. On rank > 0, message are forwarded to the parent for publication

  • If received from the overlay parent (rank > 0), messages are mcast to all children AND sent to the local broker on the interthread channel for distribution to local broker and module subscribers.

Update the overlay unit test that were exercising full event publication. Update some event sharness tests that used the other RPC name.

Problem: flux-framework#7109 moved all event publishing
from the broker to the overlay subsystem so that the peer multicast
portion could eventually run in separate thread.  Unfortunately,
moving all event publication machinery to overlay creates an
unnecessary requirement that overlay be running on a size=1 instance,
and will complicate 'flux module reload overlay', should that be
useful.

Leave the resource intensive event mcast code in overlay but
move the rest of it back into the broker.

TL;DR

Recall there are two ways that event messages are routed per RFC 3:

1) A bare event message is forwarded upstream on the TBON and published
when it arrives on rank 0.

2) An event message is base64 encoded and encapsulated in a request
message that is sent to rank 0, where it is published.

Publication occurs only on rank 0 and consists of assigning a monotonically
increasing sequence number, distributing the message to local broker
and module subscribers, and sending the event message to the overlay via
the interthread message channel for further distribution.

The overlay now routes events as follows:

- If received from the local broker on the interthread channel:
  On rank 0, messages are mcast to all children.
  On rank > 0, messages are forwarded to the parent for publication.

- If received from an overlay child:
  On rank 0, messages are forwarded to the local broker for publication.
  On rank > 0, message are forwarded to the parent for publication

- If received from the overlay parent (rank > 0), messages are mcast
  to all children AND sent to the local broker on the interthread channel
  for distribution to local broker and module subscribers.

Update the overlay unit test that were exercising full event publication.
Update some event sharness tests that used the other RPC name.
@garlick
Copy link
Member Author

garlick commented Dec 12, 2025

Rebased on current master.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick
Copy link
Member Author

garlick commented Dec 12, 2025

Thanks I'll set MWP

@mergify mergify bot added the queued label Dec 12, 2025
@mergify mergify bot merged commit ed7fdfd into flux-framework:master Dec 12, 2025
34 of 35 checks passed
@mergify
Copy link
Contributor

mergify bot commented Dec 12, 2025

Merge Queue Status

✅ The pull request has been merged at b504473

This pull request spent 6 seconds in the queue, with no time running CI.
The checks were run in-place.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = validate commits
    • check-neutral = validate commits
    • check-skipped = validate commits
  • any of [🛡 GitHub branch protection]:
    • check-success = address-sanitizer check
    • check-neutral = address-sanitizer check
    • check-skipped = address-sanitizer check
  • any of [🛡 GitHub branch protection]:
    • check-success = coverage
    • check-neutral = coverage
    • check-skipped = coverage
  • any of [🛡 GitHub branch protection]:
    • check-success = focal - py3.8
    • check-neutral = focal - py3.8
    • check-skipped = focal - py3.8
  • any of [🛡 GitHub branch protection]:
    • check-success = docs/readthedocs.org:flux-core
    • check-neutral = docs/readthedocs.org:flux-core
    • check-skipped = docs/readthedocs.org:flux-core
  • any of [🛡 GitHub branch protection]:
    • check-success = inception
    • check-neutral = inception
    • check-skipped = inception
  • any of [🛡 GitHub branch protection]:
    • check-success = flux-sched check
    • check-neutral = flux-sched check
    • check-skipped = flux-sched check
  • any of [🛡 GitHub branch protection]:
    • check-success = el8 - system,coverage
    • check-neutral = el8 - system,coverage
    • check-skipped = el8 - system,coverage
  • any of [🛡 GitHub branch protection]:
    • check-success = spelling
    • check-neutral = spelling
    • check-skipped = spelling
  • any of [🛡 GitHub branch protection]:
    • check-success = el8 - ascii
    • check-neutral = el8 - ascii
    • check-skipped = el8 - ascii
  • any of [🛡 GitHub branch protection]:
    • check-success = bookworm - 32 bit
    • check-neutral = bookworm - 32 bit
    • check-skipped = bookworm - 32 bit
  • any of [🛡 GitHub branch protection]:
    • check-success = flux-accounting check
    • check-neutral = flux-accounting check
    • check-skipped = flux-accounting check
  • any of [🛡 GitHub branch protection]:
    • check-success = python linting
    • check-neutral = python linting
    • check-skipped = python linting
  • any of [🛡 GitHub branch protection]:
    • check-success = el9 - test-install
    • check-neutral = el9 - test-install
    • check-skipped = el9 - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = fedora40 - clang-18
    • check-neutral = fedora40 - clang-18
    • check-skipped = fedora40 - clang-18
  • any of [🛡 GitHub branch protection]:
    • check-success = noble - test-install
    • check-neutral = noble - test-install
    • check-skipped = noble - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = alpine - test-install
    • check-neutral = alpine - test-install
    • check-skipped = alpine - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = fedora40 - test-install
    • check-neutral = fedora40 - test-install
    • check-skipped = fedora40 - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = bookworm - test-install
    • check-neutral = bookworm - test-install
    • check-skipped = bookworm - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = jammy - test-install
    • check-neutral = jammy - test-install
    • check-skipped = jammy - test-install
  • any of [🛡 GitHub branch protection]:
    • check-success = flux-pam check
    • check-neutral = flux-pam check
    • check-skipped = flux-pam check
  • any of [🛡 GitHub branch protection]:
    • check-success = flux-pmix check
    • check-neutral = flux-pmix check
    • check-skipped = flux-pmix check
  • any of [🛡 GitHub branch protection]:
    • check-success = bookworm - gcc-12,distcheck
    • check-neutral = bookworm - gcc-12,distcheck
    • check-skipped = bookworm - gcc-12,distcheck

@mergify mergify bot removed the queued label Dec 12, 2025
@garlick garlick deleted the event_publish branch December 12, 2025 16:14
@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

❌ Patch coverage is 78.26087% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.66%. Comparing base (87afa84) to head (b504473).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/broker/broker.c 77.14% 24 Missing ⚠️
src/broker/overlay.c 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7221      +/-   ##
==========================================
- Coverage   83.66%   83.66%   -0.01%     
==========================================
  Files         554      554              
  Lines       92625    92646      +21     
==========================================
+ Hits        77496    77509      +13     
- Misses      15129    15137       +8     
Files with missing lines Coverage Δ
src/broker/modhash.c 76.78% <100.00%> (ø)
src/common/libflux/event.c 77.66% <ø> (ø)
src/broker/overlay.c 79.46% <88.88%> (-0.14%) ⬇️
src/broker/broker.c 76.01% <77.14%> (+0.53%) ⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants