Skip to content

worker: run repo maintenance during idle time (Bug 2037216)#1135

Open
cgsheeh wants to merge 10 commits intomozilla-conduit:mainfrom
cgsheeh:idle-strip
Open

worker: run repo maintenance during idle time (Bug 2037216)#1135
cgsheeh wants to merge 10 commits intomozilla-conduit:mainfrom
cgsheeh:idle-strip

Conversation

@cgsheeh
Copy link
Copy Markdown
Member

@cgsheeh cgsheeh commented May 5, 2026

Lando's workers typically run repo-cleaning commands at
the beginning of each job processing time. In hg workers,
we run hg strip at the start of each job to remove
previously-created stale commits, despite those commits
not interfering with the job completion. In Git, we take
the opposite approach and simply ignore the temporary
work branch - it is never cleaned up.

Add a new "repo maintenance" step that runs while the
worker is being throttled due to no jobs remaining in
the queue. To avoid running excessive maintenance, the
runtime of the last maintenance run for each repo is
recorded in the worker, and maintenance is skipped if
it has been completed within the threshold.

For Mercurial, move the hg strip command into this maintenance
task, which should save us about 8s for each push to try.
For Git, add a cleanup of the stale working branches,
so we no longer have thousands of temp branches in our
worker repos.

After this change, each HgSCM.clean_repo call sites
always pass strip_non_public_commits=False, while
GitSCM.clean_repo call sites always pass True.
Remove the kwarg and make each behaviour the default.

Lando's workers typically run repo-cleaning commands at
the beginning of each job processing time. In hg workers,
we run `hg strip` at the start of each job to remove
previously-created stale commits, despite those commits
not interfering with the job completion. In Git, we take
the opposite approach and simply ignore the temporary
work branch - it is never cleaned up.

Add a new "repo maintenance" step that runs while the
worker is being throttled due to no jobs remaining in
the queue. To avoid running excessive maintenance, the
runtime of the last maintenance run for each repo is
recorded in the worker, and maintenance is skipped if
it has been completed within the threshold.

For Mercurial, move the `hg strip` command into this maintenance
task, which should save us about 8s for each push to try.
For Git, add a cleanup of the stale working branches,
so we no longer have thousands of temp branches in our
worker repos.

After this change, each `HgSCM.clean_repo` call sites
always pass `strip_non_public_commits=False`, while
`GitSCM.clean_repo` call sites always pass `True`.
Remove the kwarg and make each behaviour the default.
@cgsheeh cgsheeh requested a review from a team as a code owner May 5, 2026 19:47
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

View this pull request in Lando to land it once approved.

Copy link
Copy Markdown
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments (and note failing tests).

Comment thread src/lando/main/scm/git.py
Comment on lines +493 to +513
"""Delete leftover `lando-<timestamp>` work branches.

Each landing creates a fresh work branch in `update_repo`, and they
accumulate on disk indefinitely. Idle-time cleanup keeps the local
branch list small without affecting per-job latency.
"""
branches = self._git_run(
"for-each-ref",
"--format=%(refname:short)",
"refs/heads/lando-*",
cwd=self.path,
).splitlines()
if not branches:
return

# `git branch -D` refuses to delete the currently checked-out branch,
# so move off any `lando-*` branch first.
if self.get_current_branch().startswith("lando-"):
self._git_run("checkout", "--force", self.default_branch, cwd=self.path)

self._git_run("branch", "-D", *branches, cwd=self.path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in its own method that is called withinmaintenance, so that we can add other maintenance methods as needed, if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather hold off on this until we have a second maintenance task that justifies the split. Extracting the body into a helper would leave maintenance() as a one-line wrapper, which isn't really useful. We're assuming we will know what the next maintenance step might look like, and if we're wrong it just creates more work for the next person who adds it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could hold off, but at that point I would suggest either modifying the docstring (to be more centered around what the maintenance concept is about, and adding more detail in the summary), or renaming the method to delete_job_branches.

would leave maintenance() as a one-line wrapper

We do that elsewhere, nothing wrong with that (see run_code_formatters).

Comment thread src/lando/main/scm/git.py
Comment thread src/lando/main/scm/hg.py Outdated
Comment thread src/lando/main/scm/hg.py Outdated
Comment thread src/lando/main/scm/hg.py
except HgException:
pass
finally:
self.hg_repo.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, this should go in a separate method that is called within maintenance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the Git version of maintenance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply as well in that thread.

Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
@cgsheeh cgsheeh requested review from shtrom and zzzeid May 6, 2026 03:10
with pytest.raises(HgCommandError, match="no changes found"):
repo.run_hg_cmds([["outgoing"]])
assert not repo.run_hg_cmds([["status"]])
with repo.for_pull(), hg_clone.as_cwd():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the hg_clone.as_cwd() here?

@@ -31,7 +31,7 @@ def test_integrated_hgrepo_clean_repo(hg_clone):
repo = HgSCM(hg_clone.strpath)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we rename this variable to scm while we're at it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo = HgSCM() is the convention in this file, oddly enough. I updated this instance, but we should fix the others in a follow-up. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It pre-dates the SCM split (;



@pytest.fixture
def mocked_enabled_repos(hg_landing_worker):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the git_landing_worker instead? This may exist for longer than the hg one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried switching it over but hit test failures, filed https://bugzilla.mozilla.org/show_bug.cgi?id=2037553 to track the underlying issue.

Comment thread src/lando/api/tests/test_worker.py
assert len(mocked_enabled_repos) >= 2, "Test requires at least two enabled repos."

failing_repo, *healthy_repos = mocked_enabled_repos
failing_repo._scm.maintenance.side_effect = SCMException("boom", "", "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

Copy link
Copy Markdown
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few additional comments but looks good otherwise.

Comment thread src/lando/main/scm/git.py
Comment on lines +508 to +509
# `git branch -D` refuses to delete the currently checked-out branch,
# so move off any `lando-*` branch first.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting bit here. In the future it might make more sense to ensure we are back on the default branch after a job is finished.

Comment thread src/lando/main/scm/git.py

# `git branch -D` refuses to delete the currently checked-out branch,
# so move off any `lando-*` branch first.
if self.get_current_branch().startswith("lando-"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with our current setup, this is a little redundant. Seems that deterministically, this if statement will always be True with the exception of perhaps first start.

nit: self.get_current_branch().startswith("lando-") (or even "lando-") should probably be in their own property/method/variable. E.g., self.is_on_job_branch or something similar.

Comment thread src/lando/main/scm/git.py
Comment on lines +493 to +513
"""Delete leftover `lando-<timestamp>` work branches.

Each landing creates a fresh work branch in `update_repo`, and they
accumulate on disk indefinitely. Idle-time cleanup keeps the local
branch list small without affecting per-job latency.
"""
branches = self._git_run(
"for-each-ref",
"--format=%(refname:short)",
"refs/heads/lando-*",
cwd=self.path,
).splitlines()
if not branches:
return

# `git branch -D` refuses to delete the currently checked-out branch,
# so move off any `lando-*` branch first.
if self.get_current_branch().startswith("lando-"):
self._git_run("checkout", "--force", self.default_branch, cwd=self.path)

self._git_run("branch", "-D", *branches, cwd=self.path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could hold off, but at that point I would suggest either modifying the docstring (to be more centered around what the maintenance concept is about, and adding more detail in the summary), or renaming the method to delete_job_branches.

would leave maintenance() as a one-line wrapper

We do that elsewhere, nothing wrong with that (see run_code_formatters).

Comment thread src/lando/main/scm/hg.py
except HgException:
pass
finally:
self.hg_repo.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply as well in that thread.

Comment thread src/lando/api/legacy/workers/base.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants