worker: run repo maintenance during idle time (Bug 2037216)#1135
worker: run repo maintenance during idle time (Bug 2037216)#1135cgsheeh wants to merge 10 commits intomozilla-conduit:mainfrom
Conversation
Lando's workers typically run repo-cleaning commands at the beginning of each job processing time. In hg workers, we run `hg strip` at the start of each job to remove previously-created stale commits, despite those commits not interfering with the job completion. In Git, we take the opposite approach and simply ignore the temporary work branch - it is never cleaned up. Add a new "repo maintenance" step that runs while the worker is being throttled due to no jobs remaining in the queue. To avoid running excessive maintenance, the runtime of the last maintenance run for each repo is recorded in the worker, and maintenance is skipped if it has been completed within the threshold. For Mercurial, move the `hg strip` command into this maintenance task, which should save us about 8s for each push to try. For Git, add a cleanup of the stale working branches, so we no longer have thousands of temp branches in our worker repos. After this change, each `HgSCM.clean_repo` call sites always pass `strip_non_public_commits=False`, while `GitSCM.clean_repo` call sites always pass `True`. Remove the kwarg and make each behaviour the default.
|
View this pull request in Lando to land it once approved. |
| """Delete leftover `lando-<timestamp>` work branches. | ||
|
|
||
| Each landing creates a fresh work branch in `update_repo`, and they | ||
| accumulate on disk indefinitely. Idle-time cleanup keeps the local | ||
| branch list small without affecting per-job latency. | ||
| """ | ||
| branches = self._git_run( | ||
| "for-each-ref", | ||
| "--format=%(refname:short)", | ||
| "refs/heads/lando-*", | ||
| cwd=self.path, | ||
| ).splitlines() | ||
| if not branches: | ||
| return | ||
|
|
||
| # `git branch -D` refuses to delete the currently checked-out branch, | ||
| # so move off any `lando-*` branch first. | ||
| if self.get_current_branch().startswith("lando-"): | ||
| self._git_run("checkout", "--force", self.default_branch, cwd=self.path) | ||
|
|
||
| self._git_run("branch", "-D", *branches, cwd=self.path) |
There was a problem hiding this comment.
This should go in its own method that is called withinmaintenance, so that we can add other maintenance methods as needed, if needed.
There was a problem hiding this comment.
I'd rather hold off on this until we have a second maintenance task that justifies the split. Extracting the body into a helper would leave maintenance() as a one-line wrapper, which isn't really useful. We're assuming we will know what the next maintenance step might look like, and if we're wrong it just creates more work for the next person who adds it.
There was a problem hiding this comment.
Could hold off, but at that point I would suggest either modifying the docstring (to be more centered around what the maintenance concept is about, and adding more detail in the summary), or renaming the method to delete_job_branches.
would leave maintenance() as a one-line wrapper
We do that elsewhere, nothing wrong with that (see run_code_formatters).
| except HgException: | ||
| pass | ||
| finally: | ||
| self.hg_repo.close() |
There was a problem hiding this comment.
Same comment here, this should go in a separate method that is called within maintenance.
There was a problem hiding this comment.
See my comment on the Git version of maintenance.
There was a problem hiding this comment.
See my reply as well in that thread.
| with pytest.raises(HgCommandError, match="no changes found"): | ||
| repo.run_hg_cmds([["outgoing"]]) | ||
| assert not repo.run_hg_cmds([["status"]]) | ||
| with repo.for_pull(), hg_clone.as_cwd(): |
There was a problem hiding this comment.
What is the purpose of the hg_clone.as_cwd() here?
| @@ -31,7 +31,7 @@ def test_integrated_hgrepo_clean_repo(hg_clone): | |||
| repo = HgSCM(hg_clone.strpath) | |||
There was a problem hiding this comment.
nit: Could we rename this variable to scm while we're at it?
There was a problem hiding this comment.
repo = HgSCM() is the convention in this file, oddly enough. I updated this instance, but we should fix the others in a follow-up. :)
|
|
||
|
|
||
| @pytest.fixture | ||
| def mocked_enabled_repos(hg_landing_worker): |
There was a problem hiding this comment.
Could we use the git_landing_worker instead? This may exist for longer than the hg one.
There was a problem hiding this comment.
I tried switching it over but hit test failures, filed https://bugzilla.mozilla.org/show_bug.cgi?id=2037553 to track the underlying issue.
| assert len(mocked_enabled_repos) >= 2, "Test requires at least two enabled repos." | ||
|
|
||
| failing_repo, *healthy_repos = mocked_enabled_repos | ||
| failing_repo._scm.maintenance.side_effect = SCMException("boom", "", "") |
zzzeid
left a comment
There was a problem hiding this comment.
Few additional comments but looks good otherwise.
| # `git branch -D` refuses to delete the currently checked-out branch, | ||
| # so move off any `lando-*` branch first. |
There was a problem hiding this comment.
Interesting bit here. In the future it might make more sense to ensure we are back on the default branch after a job is finished.
|
|
||
| # `git branch -D` refuses to delete the currently checked-out branch, | ||
| # so move off any `lando-*` branch first. | ||
| if self.get_current_branch().startswith("lando-"): |
There was a problem hiding this comment.
nit: with our current setup, this is a little redundant. Seems that deterministically, this if statement will always be True with the exception of perhaps first start.
nit: self.get_current_branch().startswith("lando-") (or even "lando-") should probably be in their own property/method/variable. E.g., self.is_on_job_branch or something similar.
| """Delete leftover `lando-<timestamp>` work branches. | ||
|
|
||
| Each landing creates a fresh work branch in `update_repo`, and they | ||
| accumulate on disk indefinitely. Idle-time cleanup keeps the local | ||
| branch list small without affecting per-job latency. | ||
| """ | ||
| branches = self._git_run( | ||
| "for-each-ref", | ||
| "--format=%(refname:short)", | ||
| "refs/heads/lando-*", | ||
| cwd=self.path, | ||
| ).splitlines() | ||
| if not branches: | ||
| return | ||
|
|
||
| # `git branch -D` refuses to delete the currently checked-out branch, | ||
| # so move off any `lando-*` branch first. | ||
| if self.get_current_branch().startswith("lando-"): | ||
| self._git_run("checkout", "--force", self.default_branch, cwd=self.path) | ||
|
|
||
| self._git_run("branch", "-D", *branches, cwd=self.path) |
There was a problem hiding this comment.
Could hold off, but at that point I would suggest either modifying the docstring (to be more centered around what the maintenance concept is about, and adding more detail in the summary), or renaming the method to delete_job_branches.
would leave maintenance() as a one-line wrapper
We do that elsewhere, nothing wrong with that (see run_code_formatters).
| except HgException: | ||
| pass | ||
| finally: | ||
| self.hg_repo.close() |
There was a problem hiding this comment.
See my reply as well in that thread.
Lando's workers typically run repo-cleaning commands at
the beginning of each job processing time. In hg workers,
we run
hg stripat the start of each job to removepreviously-created stale commits, despite those commits
not interfering with the job completion. In Git, we take
the opposite approach and simply ignore the temporary
work branch - it is never cleaned up.
Add a new "repo maintenance" step that runs while the
worker is being throttled due to no jobs remaining in
the queue. To avoid running excessive maintenance, the
runtime of the last maintenance run for each repo is
recorded in the worker, and maintenance is skipped if
it has been completed within the threshold.
For Mercurial, move the
hg stripcommand into this maintenancetask, which should save us about 8s for each push to try.
For Git, add a cleanup of the stale working branches,
so we no longer have thousands of temp branches in our
worker repos.
After this change, each
HgSCM.clean_repocall sitesalways pass
strip_non_public_commits=False, whileGitSCM.clean_repocall sites always passTrue.Remove the kwarg and make each behaviour the default.