Skip to content

Conversation

@grahamking
Copy link
Contributor

@grahamking grahamking commented Oct 27, 2025

Instead there is a connection_id() -> u64 method on DistributedRuntime, which is always present. Remove the Lease object and the unused lease related methods.

Also delete legacy unused DiscoveryClient.

Another step towards making etcd optional.

Summary by CodeRabbit

Release Notes

  • Breaking Changes

    • Updated worker identification mechanism: Endpoint.lease_id() method replaced with Endpoint.connection_id() returning an opaque, time-variable worker identifier. Existing code using lease-based identifiers requires updates.
  • Improvements

    • Simplified internal worker identification and event publishing infrastructure by transitioning from lease-based to connection-based identifiers across distributed components.

@grahamking grahamking requested review from a team as code owners October 27, 2025 20:21
@github-actions github-actions bot added the chore label Oct 27, 2025
@rmccorm4 rmccorm4 requested a review from kthui October 27, 2025 20:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

The pull request replaces lease-based identifiers with connection-based identifiers throughout the codebase. This involves removing the Lease abstraction, updating the Endpoint API from lease_id() to connection_id(), removing the discovery module, and refactoring lease lifecycle management in the etcd transport layer.

Changes

Cohort / File(s) Summary
Python/Component Publisher Updates
components/src/dynamo/sglang/publisher.py, components/src/dynamo/trtllm/main.py, components/src/dynamo/vllm/main.py, examples/multimodal/components/worker.py
Changed worker_id source from endpoint.lease_id() to endpoint.connection_id() in KV event publisher configuration across multiple component implementations.
Rust Bindings - Endpoint API
lib/bindings/python/rust/lib.rs, lib/bindings/python/src/dynamo/_core.pyi
Renamed method from Endpoint::lease_id() to Endpoint::connection_id() with implementation change to use self.inner.drt().connection_id(). Updated docstring from "primary lease id" to "opaque unique ID for this worker."
KV Event Publishing
lib/bindings/python/rust/llm/kv.rs, lib/llm/src/kv_router.rs, lib/llm/src/kv_router/publisher.rs, lib/llm/src/mocker/engine.rs
Changed worker_id source from primary lease to connection_id; removed lease validation check in KV initialization; updated cancellation token source in KvRouter from primary lease token to direct drt() token.
Runtime Lease and Discovery Removal
lib/runtime/src/component.rs, lib/runtime/src/discovery.rs, lib/runtime/src/lib.rs
Removed Lease import from component module; deleted entire discovery.rs module containing DiscoveryClient struct and lease management APIs; removed pub mod discovery export from crate root.
Endpoint Configuration Refactoring
lib/runtime/src/component/endpoint.rs
Removed Lease field from EndpointConfig; replaced all lease_id usages with connection_id across etcd paths, health checks, discovery payloads, and cancellation token logic; updated endpoint startup to pass connection_id instead of lease_id.
Distributed Runtime API
lib/runtime/src/distributed.rs, lib/runtime/src/storage/key_value_store/etcd.rs
Removed primary_lease() method; added connection_id() method delegating to store; removed discovery_client internal API; updated etcd PutOptions to use direct lease_id() instead of primary_lease().id().
Etcd Transport Layer
lib/runtime/src/transports/etcd.rs
Removed public Lease type and all associated lifecycle methods (id, primary_token, child_token, revoke, is_valid); removed primary_lease(), create_lease(), and revoke_lease() public APIs; updated kv_put_with_options to use direct lease_id().
Lease Lifecycle Management
lib/runtime/src/transports/etcd/lease.rs
Refactored create_lease() to return anyhow::Result<u64> instead of Result; updated keep_alive() to use anyhow error handling; replaced create_deadline helper with Instant-based deadline calculations; removed revoke_lease() public API.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • lib/runtime/src/transports/etcd/lease.rs — Significant logic changes to deadline handling and error propagation; transition from Lease struct return to direct u64 ID; requires careful verification of keep-alive lifecycle and cancellation semantics.
  • lib/runtime/src/component/endpoint.rs — Extensive refactoring removing Lease field and replacing lease-based logic with connection_id across multiple subsystems (etcd, health checks, discovery); impacts endpoint registration and cancellation token flow.
  • lib/runtime/src/transports/etcd.rs — Removal of Lease type (a public API surface) and consolidation of lease management; requires checking that all removal sites are correctly updated and that migration from Lease wrapper to direct u64 IDs is complete.
  • lib/bindings/python/rust/llm/kv.rs — Removed lease validation check; verify that removal of this guard does not introduce unexpected behavior during KvRouter initialization.
  • lib/runtime/src/distributed.rs — API surface change (removal of primary_lease, addition of connection_id); ensure all call sites are updated and semantics are correctly preserved.

Poem

🐰 From leases of old, we hop to the new,
Connection IDs shine, a clear worker view!
The Lease type has gone, discovery too—
Endpoint now calls connection_id() true!
Cleaner, simpler paths through the etcd maze we knew! 🎉

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The provided description does not follow the repository's pull request template structure. The template specifies required sections including Overview, Details, Where should the reviewer start, and Related Issues, but the author's description omits these organizational sections entirely. While the description does convey some technical details about the changes (addition of connection_id method, removal of Lease object and DiscoveryClient), it lacks the structured presentation and required contextual information specified by the template. The description also does not indicate any related issues or provide guidance on code review focus areas. Revise the pull request description to follow the repository template. Add an Overview section summarizing the rationale for this refactoring (moving from lease-based to connection-based identifiers), expand the Details section with more context about each change, include a "Where should the reviewer start?" section highlighting critical files, and add any Related Issues using the action keywords (e.g., "Closes #xxx").
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Do not expose etcd lease ID" refers to a real and significant aspect of the changeset—the removal of the Lease object and associated lease-based APIs. However, the title is incomplete as it does not mention the core replacement mechanism: the introduction of the connection_id() method that supersedes lease-based identifiers throughout the codebase. The title captures one dimension of the refactoring (removal of lease exposure) but omits the complementary addition (connection_id replacement) and the removal of DiscoveryClient, which are central to understanding the pull request's full scope.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/runtime/src/transports/etcd.rs (2)

258-270: Apply conditional lease handling to all lease-using functions, not just lock().

The suggested refactoring is correct: when lease_id is 0 (which occurs when config.attach_lease is false), passing None instead of Some(options) is more semantically appropriate. However, the same issue affects three other functions that use identical patterns:

  • kv_create (line 117)
  • kv_create_or_validate (line 147)
  • kv_put (line 198)

For consistency and correctness, apply this conditional pattern to all four functions that handle leases, not just lock().


208-217: Conditionally attach lease only when > 0 across all affected functions (kv_create, kv_create_or_validate, kv_put, kv_put_with_options, lock).

Lease ID 0 is NoLease in etcd semantics, representing absence of a lease and is not a valid lease identifier. The current code unconditionally passes with_lease(0) when self.primary_lease is 0 (when config.attach_lease is disabled), which is semantically incorrect.

Apply conditional checks to all five functions:

  • Line 118 (kv_create)
  • Line 148 (kv_create_or_validate)
  • Line 199 (kv_put)
  • Line 216 (kv_put_with_options)
  • Line 265 (lock)

Each should check if lease > 0 before calling with_lease(), otherwise omit the lease parameter entirely (passing None for options or not attaching the lease).

🧹 Nitpick comments (9)
lib/llm/src/mocker/engine.rs (1)

250-250: Remove commented-out debug code.

This commented line appears to be leftover from debugging or testing and should be removed to keep the codebase clean.

Apply this diff:

         let worker_id = comp.drt().connection_id();
-        // let worker_id = 0;
         tracing::debug!("Worker_id set to: {worker_id}");
lib/bindings/python/rust/lib.rs (1)

776-779: Expose a temporary backward‑compat alias for lease_id().

To reduce downstream breakage, consider adding a deprecated alias that forwards to connection_id(). Example:

 #[pymethods]
 impl Endpoint {
     // Opaque unique ID for this worker. May change over worker lifetime.
     fn connection_id(&self) -> u64 {
         self.inner.drt().connection_id()
     }
+    /// [DEPRECATED] Use connection_id()
+    fn lease_id(&self) -> u64 {
+        self.connection_id()
+    }
 }
lib/runtime/src/component/endpoint.rs (2)

64-69: Naming mismatch: etcd_path_with_lease_id now takes a connection_id.

Low risk but confusing. Consider a clearer alias to avoid “lease” wording drift.

For example, in lib/runtime/src/component.rs add:

pub fn etcd_path_with_instance_id(&self, instance_id: u64) -> String {
    self.etcd_path_with_lease_id(instance_id)
}

Then use the alias here.


198-221: Avoid attaching an etcd lease when connection_id is 0; fail fast or skip lease.

If attach_lease=false ever yields connection_id==0 while etcd_client is Some, with_lease(0) may be invalid. Guard it:

-        if let Some(etcd_client) = &etcd_client
-            && let Err(e) = etcd_client
-                .kv_create(&etcd_path, info, Some(connection_id))
+        let lease_opt = if connection_id == 0 { None } else { Some(connection_id) };
+        if let Some(etcd_client) = &etcd_client
+            && let Err(e) = etcd_client
+                .kv_create(&etcd_path, info, lease_opt)
                 .await
         {
             tracing::error!( ... );
             runtime_shutdown_token.cancel();
             return Err(error!("Unable to register service for discovery. Check discovery service status"));
         }

Optionally also add:

-        let connection_id = endpoint.drt().connection_id();
+        let connection_id = endpoint.drt().connection_id();
+        debug_assert!(connection_id != 0, "connection_id should be non-zero when registering in etcd");
lib/runtime/src/transports/etcd.rs (1)

491-500: TODO comment can be updated to reflect connection_id semantics.

The comment still says “proper lease handling.” Consider clarifying that absence of a lease (connection_id==0) should write keys without a lease.

lib/runtime/src/transports/etcd/lease.rs (4)

9-11: Docs are stale and grammatical fix required.

Mentions returning a Lease, but API returns a u64 lease_id; also “it's” → “its”. Update for accuracy.

Apply:

-/// Create a [`Lease`] with a given time-to-live (TTL) attached to the [`CancellationToken`] and
-/// start it's keep-alive thread.
+/// Create an etcd lease with the given TTL, attach it to the provided cancellation token,
+/// spawn a keep-alive task, and return the lease id (u64).
+///
+/// Note: this function spawns a background task that maintains the lease until the token is
+/// cancelled or an unrecoverable error occurs.

81-82: Redundant binding before ?.

let _ = client.revoke(...).await?; discards the value but still propagates errors. Just await the call.

Apply:

-                let _ = client.revoke(lease_id as i64).await?;
+                client.revoke(lease_id as i64).await?;

38-42: Fix incomplete doc comment.

Stray “/// If” reads as a fragment. Tighten the section to clearly state behavior on error/cancellation.

Apply:

 /// Task to keep leases alive.
 ///
-/// If this task returns an error, the cancellation token will be invoked on the runtime.
-/// If
+/// On error, the parent cancellation token is triggered to stop the runtime.
+/// On token cancellation, the lease is revoked and the task exits Ok(()).

66-66: Tracing fields: prefer explicit formatting for clarity.

Use %lease_id or lease_id = lease_id so logs consistently render the id.

Apply, e.g.:

-                    tracing::trace!(lease_id, "keep alive response received: {:?}", resp);
+                    tracing::trace!(lease_id = %lease_id, "keep alive response received: {:?}", resp);
@@
-                tracing::trace!(lease_id, "cancellation token triggered; revoking lease");
+                tracing::trace!(lease_id = %lease_id, "cancellation token triggered; revoking lease");
@@
-                tracing::trace!(lease_id, "sending keep alive");
+                tracing::trace!(lease_id = %lease_id, "sending keep alive");

Also applies to: 80-80, 86-86

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c8d94 and 2f767d4.

📒 Files selected for processing (18)
  • components/src/dynamo/sglang/publisher.py (1 hunks)
  • components/src/dynamo/trtllm/main.py (1 hunks)
  • components/src/dynamo/vllm/main.py (1 hunks)
  • examples/multimodal/components/worker.py (1 hunks)
  • lib/bindings/python/rust/lib.rs (1 hunks)
  • lib/bindings/python/rust/llm/kv.rs (1 hunks)
  • lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
  • lib/llm/src/kv_router.rs (1 hunks)
  • lib/llm/src/kv_router/publisher.rs (1 hunks)
  • lib/llm/src/mocker/engine.rs (1 hunks)
  • lib/runtime/src/component.rs (0 hunks)
  • lib/runtime/src/component/endpoint.rs (8 hunks)
  • lib/runtime/src/discovery.rs (0 hunks)
  • lib/runtime/src/distributed.rs (1 hunks)
  • lib/runtime/src/lib.rs (0 hunks)
  • lib/runtime/src/storage/key_value_store/etcd.rs (2 hunks)
  • lib/runtime/src/transports/etcd.rs (3 hunks)
  • lib/runtime/src/transports/etcd/lease.rs (5 hunks)
💤 Files with no reviewable changes (3)
  • lib/runtime/src/component.rs
  • lib/runtime/src/lib.rs
  • lib/runtime/src/discovery.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-05T01:04:24.775Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
🧬 Code graph analysis (10)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
lib/runtime/src/transports/etcd.rs (2)
  • new (61-104)
  • new (476-519)
lib/bindings/python/src/dynamo/_core.pyi (3)
lib/bindings/python/rust/lib.rs (1)
  • connection_id (777-779)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/runtime/src/component/endpoint.rs (8)
lib/bindings/python/rust/lib.rs (2)
  • connection_id (777-779)
  • endpoint (678-684)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • connection_id (160-164)
  • endpoint (117-121)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/runtime/src/storage/key_value_store.rs (3)
  • connection_id (114-114)
  • connection_id (161-168)
  • connection_id (210-212)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (107-109)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (52-54)
lib/runtime/src/component.rs (6)
  • endpoint (270-278)
  • etcd_path_with_lease_id (542-544)
  • subject (577-579)
  • etcd_path (253-256)
  • etcd_path (532-539)
  • etcd_path (682-684)
components/src/dynamo/trtllm/main.py (4)
lib/bindings/python/rust/lib.rs (2)
  • endpoint (678-684)
  • connection_id (777-779)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • endpoint (117-121)
  • connection_id (160-164)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/bindings/python/rust/lib.rs (6)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • connection_id (160-164)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/runtime/src/storage/key_value_store.rs (3)
  • connection_id (114-114)
  • connection_id (161-168)
  • connection_id (210-212)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (107-109)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (52-54)
lib/runtime/src/distributed.rs (6)
lib/bindings/python/rust/lib.rs (1)
  • connection_id (777-779)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • connection_id (160-164)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/runtime/src/storage/key_value_store.rs (3)
  • connection_id (114-114)
  • connection_id (161-168)
  • connection_id (210-212)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (107-109)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (52-54)
components/src/dynamo/vllm/main.py (4)
lib/bindings/python/rust/lib.rs (1)
  • connection_id (777-779)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • connection_id (160-164)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
components/src/dynamo/sglang/publisher.py (4)
lib/bindings/python/rust/lib.rs (1)
  • connection_id (777-779)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • connection_id (160-164)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
examples/multimodal/components/worker.py (4)
lib/bindings/python/rust/lib.rs (2)
  • endpoint (678-684)
  • connection_id (777-779)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • endpoint (117-121)
  • connection_id (160-164)
lib/runtime/src/distributed.rs (1)
  • connection_id (213-215)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (54-56)
lib/runtime/src/transports/etcd.rs (1)
lib/runtime/src/transports/etcd/lease.rs (1)
  • create_lease (11-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (15)
lib/llm/src/mocker/engine.rs (1)

249-249: LGTM! Correct migration to connection-based identifier.

The change from lease-based to connection-based worker identification aligns with the PR objectives and correctly uses the new connection_id() API on DistributedRuntime.

lib/llm/src/kv_router/publisher.rs (1)

787-787: LGTM! Clean migration to connection-based identifier.

The change from lease-based to connection-based worker ID is consistent with the PR's objectives and maintains the same functionality.

lib/llm/src/kv_router.rs (1)

227-227: LGTM! Simplified token acquisition.

Removing the intermediate primary_lease() call streamlines the code while maintaining the same functionality.

components/src/dynamo/vllm/main.py (1)

138-138: LGTM! Consistent API migration.

The change from lease_id() to connection_id() aligns with the new Endpoint API and maintains the same worker identification functionality.

components/src/dynamo/trtllm/main.py (1)

424-424: LGTM! Consistent with API changes.

The migration to connection_id() matches the pattern applied across other components.

examples/multimodal/components/worker.py (1)

166-166: LGTM! Consistent API update.

The change aligns with the connection-based identifier migration applied throughout the codebase.

lib/runtime/src/storage/key_value_store/etcd.rs (1)

177-177: LGTM! Simplified lease ID access for etcd operations.

The change from primary_lease().id() to lease_id() removes an intermediate abstraction while maintaining the same functionality for associating etcd keys with leases.

Also applies to: 246-246

lib/bindings/python/rust/llm/kv.rs (1)

994-1002: LGTM! Removal of lease validation aligns with API changes.

The removed check for primary_lease() existence is consistent with removing the Lease abstraction from the public API. Since leases still exist internally, any actual lease-related issues will be caught downstream in the KvRouter creation flow.

lib/runtime/src/distributed.rs (1)

213-215: LGTM! Well-designed public API for connection identification.

The connection_id() method provides a clean abstraction by delegating to the underlying store, which is appropriate since the connection ID is tied to the storage backend (etcd lease, NATS client, or memory).

components/src/dynamo/sglang/publisher.py (1)

126-131: Good switch to connection_id; align Python stub to sync.

This aligns with the runtime’s new identifier. However, lib/bindings/python/src/dynamo/_core.pyi declares Endpoint.connection_id() as async, while the runtime method is synchronous. Please update the stub to a synchronous def to avoid type-checker confusion. Based on relevant snippets.

lib/runtime/src/component/endpoint.rs (1)

156-161: LGTM on cancellation semantics simplification.

Using only runtime_shutdown_token is fine since lease keepalive now cancels the primary token on failure, propagating to this child.

lib/runtime/src/transports/etcd.rs (3)

141-165: Verify claim that conditional lease handling already exists elsewhere.

The review states "Same conditional lease handling" but after exhaustive searching:

  • kv_create (line 116) uses unconditional: PutOptions::new().with_lease(id as i64)
  • kv_put (line 192) uses unconditional: PutOptions::new().with_lease(id as i64)
  • kv_put_with_options (line 208) uses unconditional: .with_lease(self.lease_id() as i64)
  • No method across etcd.rs, key_value_store/etcd.rs, or lock.rs applies the proposed conditional .then(||) pattern

The suggested conditional pattern (id > 0).then(|| PutOptions::new().with_lease(id as i64)) appears to be a new refactoring, not an existing pattern elsewhere. If this change is desirable, the same logic should likely apply to kv_create and kv_put for consistency. Verify intent: Is this a bug fix or aspirational refactoring that should span multiple methods?


192-206: Address the inconsistency in conditional lease handling across related methods.

The suggested change for kv_put is technically correct—the put() method accepts Option<PutOptions> directly, so (id > 0).then(|| PutOptions::new().with_lease(id as i64)) is valid and clearer than always creating PutOptions with a zero lease. However, three other methods use the identical current pattern:

  • kv_create (lines 116–118): unconditional PutOptions
  • kv_create_or_validate (lines 147–148): unconditional PutOptions
  • lock() (line 264): unconditional LockOptions

For consistency, verify whether all similar methods should also apply conditional lease handling when id > 0, or confirm that kv_put alone requires this change. If updating all, prioritize kv_create and kv_create_or_validate first since they follow the same call pattern.


116-126: Apply lease guard consistently across all kv operations; current fix incomplete.

The proposed fix correctly changes put_options from PutOptions to Option<PutOptions> using (id > 0).then(...), matching the TxnOp::put API signature. However, the same pattern (unconditionally applying with_lease(id as i64) where id could be 0) appears in at least two other methods:

  • kv_create_or_validate (line 148)
  • kv_put (line 199)

Apply the lease guard consistently across all three methods for uniformity. Additionally, verify whether self.lease_id() can legitimately return 0 in your initialization logic to ensure this guard is necessary.

lib/runtime/src/transports/etcd/lease.rs (1)

11-16: Based on the verification through code inspection and web search on CancellationToken semantics, the review comment is valid and addresses a real architectural issue. The standard pattern for using CancellationToken in multiple contexts is to clone the token and pass clones to different tasks. The current create_lease function consumes the token, preventing the caller from using it elsewhere.

However, examining the actual codebase:

  • The function does already create a child token (line 20: let child = token.child_token())
  • The spawned task uses this child token for cancellation monitoring
  • But the original token is still moved into the task and called with .cancel() on error (line 30)

The design issue is real: if the keep-alive task encounters an error, it cancels the original parent token, which will cascade cancellation to all other operations sharing that token. The proposed refactoring improves this by:

  1. Taking a reference instead (non-breaking for callers already cloning)
  2. Creating a clone internally for the error-handling task
  3. Allowing the parent token to remain under the caller's control

The review comment is technically sound and addresses a valid design concern. The proposed fix is appropriate for the stated use case.

Token ownership: create_lease currently consumes the caller's CancellationToken.

The function takes owned token and moves it into a spawned async task, preventing reuse by the caller. If the keep-alive task encounters an error, it calls token.cancel(), which cascades to all operations using that token.

Change the signature to take &CancellationToken and clone internally within the spawned task:

 pub async fn create_lease(
     mut lease_client: LeaseClient,
     ttl: u64,
-    token: CancellationToken,
+    token: &CancellationToken,
 ) -> anyhow::Result<u64> {
     let lease = lease_client.grant(ttl as i64, None).await?;
     let id = lease.id() as u64;
     let ttl = lease.ttl() as u64;
     let child = token.child_token();
+    let parent = token.clone();
 
     tokio::spawn(async move {
         match keep_alive(lease_client, id, ttl, child).await {
             Ok(_) => tracing::trace!("keep alive task exited successfully"),
             Err(e) => {
                 tracing::error!(error = %e, "Unable to maintain lease. Check etcd server status");
-                token.cancel();
+                parent.cancel();
             }
         }
     });
 
     Ok(id)
 }

This preserves the caller's token for broader shutdown orchestration while still allowing the keep-alive task to signal its own failure.

Instead there is a `connection_id() -> u64` method on
`DistributedRuntime`, which is always present. Remove the `Lease` object
and the unused lease related methods.

Also delete legacy unused `DiscoveryClient`.

Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
Lost in rebase

Signed-off-by: Graham King <[email protected]>
@grahamking grahamking enabled auto-merge (squash) October 28, 2025 16:07
@grahamking grahamking merged commit c78b590 into main Oct 28, 2025
29 of 31 checks passed
@grahamking grahamking deleted the gk-etcd-p5 branch October 28, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants