Skip to content

Conversation

zmc
Copy link
Member

@zmc zmc commented Mar 5, 2025

This fixes a bug where the supervisor would avoid locking its own nodes.

Example from job 8142206:

2025-03-04T22:01:06.922 ERROR:teuthology.lock.ops:Refusing to unlock smithi087 since it has an active job: skanta-2025-02-20_02:03:15-rados-wip-bharath1-testing-2025-02-19-0904-distro-default-smithi/8142206

Note that the job IDs are the same.

zmc added 4 commits March 4, 2025 18:15
Paddles uses this format.

Signed-off-by: Zack Cerza <[email protected]>
This was simply faulty logic. One effect was that the supervisor would refuse to
unlock nodes that were used by its own job.

Signed-off-by: Zack Cerza <[email protected]>
The supervisor knows when its job's nodes are no longer needed.

Signed-off-by: Zack Cerza <[email protected]>
@zmc zmc requested a review from a team as a code owner March 5, 2025 01:18
@zmc zmc requested review from jmundack and VallariAg and removed request for a team March 5, 2025 01:18
if job_config.get("unlock_on_failure", True):
log.info('Unlocking machines...')
lock_ops.unlock_safe(locked, job_config["owner"], job_config["name"], job_config["job_id"])
lock_ops.unlock_many(locked, job_config["owner"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we switch to unlock_many? It doesn't seem to have any checks before unlocking the nodes.

@jmundack jmundack requested review from a team, VallariAg and amathuria and removed request for jmundack and a team March 7, 2025 19:17
@jmundack
Copy link
Contributor

@amathuria - mind taking a look when you have a few mins?

@zmc zmc marked this pull request as draft March 18, 2025 23:09
@zmc
Copy link
Member Author

zmc commented Mar 18, 2025

I discovered a potential issue with jobs that run too long and get killed; I'll update once I've gotten to the bottom of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants