Skip to content

Conversation

@tstamler
Copy link
Contributor

@tstamler tstamler commented Nov 10, 2025

What?

Add a heartbeat mechanism to NIXL ETCD metadata to invalidate stale metadata if an agent dies.

When an agent uploads metadata to ETCD, it will only lease that data and not store permanently. Then, a heartbeat thread will be spawned to continue renewing the lease on the metadata as long as the agent exists. If the lease is not renewed, the metadata will be removed, which will automatically tell any NIXL agents watching to remove that agent.

Why?

If a NIXL agent has uploaded metadata to ETCD and is unexpectedly killed (or user exits application without invalidating), the metadata it had uploaded will remain in ETCD indefinitely and can confuse applications that rely on ETCD for agent coordination.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi tstamler! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@tstamler tstamler marked this pull request as ready for review November 12, 2025 22:24
@tstamler tstamler requested a review from a team as a code owner November 12, 2025 22:24
@aranadive
Copy link
Contributor

/ok to test c2065fd

@aranadive
Copy link
Contributor

/build

aranadive
aranadive previously approved these changes Nov 14, 2025
Copy link
Contributor

@aranadive aranadive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@aranadive
Copy link
Contributor

/ok to test 4e8f87d

@aranadive
Copy link
Contributor

/build

@2dm
Copy link
Contributor

2dm commented Nov 19, 2025

@tstamler I added the fixes we talked about.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 19, 2025

/ok to test 4d6cdcd

@tstamler, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@tstamler
Copy link
Contributor Author

/ok to test 5220637

@tstamler
Copy link
Contributor Author

/build

@aranadive
Copy link
Contributor

/build

1 similar comment
@tstamler
Copy link
Contributor Author

tstamler commented Dec 3, 2025

/build

@aranadive
Copy link
Contributor

/ok to test 73195df

@tstamler
Copy link
Contributor Author

tstamler commented Dec 4, 2025

/ok to test 62eea59

@aranadive
Copy link
Contributor

/ok to test 179ecbd

@aranadive
Copy link
Contributor

/build

// Get current index to watch from
etcd::Response response = etcd->get(metadata_key);
int64_t watch_index = response.index();
int64_t watch_index = response.index() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a watcher for the current index can "replay" events from that index.
For example, an agent is removed and re-added, as can happen in the an elastic use case.
The created watcher gets the delete event and closes.
Setting +1 for the index it is created on, where it failed to fetch, will notify it of changes that happens from this point on.

void
startHeartbeatThread(uint64_t lease_id) {
while (!heartbeat_thread_stop) {
// keep alive for twice the heartbeat interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why 2x?

etcd::Response response = etcd->leasegrant((heartbeat.count()) * 2);

if (response.is_ok()) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line?


NIXL_DEBUG << "Using etcd namespace for agents: " << namespace_prefix;

etcd::Response response = etcd->leasegrant((heartbeat.count()) * 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2x?

This posts metadata to etcd and then is killed.
"""
logger.info("Target subprocess started")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please reduce the empty lines in this file?

* @var Heartbeat interval in seconds
* Interval for heartbeat that keeps remote metadata valid.
*/
std::chrono::seconds heartbeatInterval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, to be able to review this PR, there needs to be a detailed description that explains the why & how as per template.

@ovidiusm ovidiusm disabled auto-merge December 10, 2025 15:56
Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse commWorker thread instead of creating a new one?
Or, if it is not possible, use etcd builtin etcd::KeepAlive keepalive which is sipposed to do that internally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants