Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NodeStateManager lost update problem #25069

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

brybacki
Copy link
Contributor

@brybacki brybacki commented Feb 18, 2025

Description

The NodeStateManager's drain implementation suffered from a lost update problem. A rapid sequence of drain/active/drain requests could cause the worker to prematurely exit a new draining state, ignoring in-flight tasks, because it failed to distinguish between old and new drain operations.

This was resolved by adding versioning to the node state, ensuring each state change request is uniquely identified and processed to completion. Each time the worker state is changed, the version number increases. So even if the state changes quickly from DRAINING,1 to ACTIVE,2 and back to DRAINING,3, the handling code can distinguish which state it was called for. This ensures the handling code only completes a drain action when it's working on the correct (latest) version of the request.

Additional context and related issues

Example:

  1. NodeStateManager is in draining state handling drain -> waitActiveTasksToFinish.
  2. It transitions to ACTIVE - this finishes waitTasksToFinish(activeTasks); and it goes to sleepUninterruptibly(gracePeriod.toMillis(), MILLISECONDS);
  3. It gets some new tasks
  4. It transitions to DRAINING again so new drain is started
  5. The original drain finishes sleepUninterruptibly(gracePeriod.toMillis(), MILLISECONDS); and goes to drainingComplete() - it will just start transition to DRAINED, while it should not.

The ACTIVE was not observed by all the checks here. The worker will be in a NEW DRAINING state, with new tasks to drain, it should not continue to execute previous draining/drained logic.

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

Changing "NodeState" to "Worker State change" clarifies that the logs
refer to the worker's state transitions and makes the log consistent
with messages in ServerInfoResource. Consistent logs are easier to grep.
The NodeStateManager's drain implementation suffered from a lost update
problem. A rapid sequence of drain/active/drain requests could cause the
worker to prematurely exit a new draining state, ignoring in-flight
tasks, because it failed to distinguish between old and new drain
operations.

This was resolved by adding versioning to the node state, ensuring each
state change request is uniquely identified and processed to completion.
Each time the worker state is changed, the version number increases. So
even if the state changes quickly from DRAINING,1 to ACTIVE,2 and back
to DRAINING,3, the handling code can distinguish which state it was
called for. This ensures the handling code only completes a drain action
when it's working on the correct (latest) version of the request.

private record VersionedState(NodeState state, long version)
{
private static final AtomicLong versionProvider = new AtomicLong(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it in NodeStateManager and make non-static

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it does not need to be atomic as it is accessed from synchronized blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants