-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update revert docs to align with current state #487
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,18 +68,24 @@ Our corpus/history (n) will be well over 250, so use that line. Our sample size | |
If you are making a large change, and are concerned about breaking job flavors that are not covered by your repository's usual presubmits, consider running [`/payload ...` testing](/docs/release-oversight/payload-testing) before merging your change. | ||
|
||
#### Who is watching for payload regressions and opening reverts? | ||
For now, the [Technical Release Team](/docs/release-oversight/the-technical-release-team) watches payload results on the current release branch. They will open the revert PRs and engage the involved teams (any team that contributed a PR to the payload that regressed). | ||
After your code merges, your work isn't complete until it's part of two accepted payloads. This ensures your code works, and is upgradeable. If a payload fails after your change merges, check if your code is the cause and, if so, revert that change. | ||
|
||
The [Technical Release Team](/docs/release-oversight/the-technical-release-team) acts as a safety net, identifying and reverting breaking changes that aren’t already found. | ||
|
||
#### How will we be contacted if our PR is part of a regressed payload? | ||
TRT will ping the relevant team slack aliases in #forum-release-oversight. | ||
TRT will ping the relevant team slack aliases in #forum-release-oversight. | ||
|
||
#### Who approves the revert for merge? | ||
In order to ensure they are included in the conversation/resolution, we would like the team that delivered the original PR to /approve the reversion. However, if they are unavailable or unresponsive this may be escalated to the staff-engineering team to ensure we do not remain in a regressed state longer than necessary. | ||
TRT has authority to apply the approved label to any revert in the OpenShift org, even for repositories they are not in OWNERS. They will notify the original author on Slack and in the revert PR, but to avoid staying regressed any longer than necessary, will not wait for acknowledgement. This also means if you're reverted, you don't need to respond until your normal working hours. | ||
|
||
If the repository is returning to a previously tested state (i.e. the reverted commit is the most recent), or the change is relatively isolated or simple, TRT may also `/override` certain CI jobs to expedite the revert. Generally, unit tests, lint, verify, image and other fast running jobs should not be overriden. | ||
|
||
The goal of fast reverts is to increase calmness, not chaos. Reverts get the org restored to working, and you have time and space to think about the best fix for the problem. | ||
|
||
#### What happens if a giant change like a rebase with high business value and on the critical path of the release regresses the pass rate? Will it be reverted? Will we work on it until it passes good enough? What if the kubelet build hours or days later causes the regression? | ||
Since we have a way to run the payload promotion checks before a PR merges, we encourage high risk changes to run the payload acceptance before they merge. If that payload acceptance test fails, then it is 95% likely (math, not gut) that the PR is reducing our reliability. That signal is enough to engage other teams if necessary, but since most repos other than openshift/kubernetes are owned by a single team and that team is the local expert, in the majority of cases the expert should be local. | ||
|
||
If the PR isn’t pre-checked and we catch it during payload promotion, then even a high business value PR is subject to reversion. The unrevert PR is the place to talk about expected and observed impact to reliability versus the new feature. Both the revert and the unrevert are human processes, so there is room for discussion. | ||
If the PR isn’t pre-checked and we catch it during payload promotion, then even a high business value PR is subject to reversion. The unrevert PR is the place to talk about expected and observed impact to reliability versus the new feature. | ||
|
||
#### What does an “identified fix” look like? | ||
To avoid merging the revert, teams will need to be able to point to an open PR that contains a fix, and which has passed the payload acceptance checks. | ||
|
@@ -93,7 +99,6 @@ Given the timing of these events, it’s understandable that many teams aren’t | |
|
||
Remember, anytime our payloads are regressed, the entire org is being impacted. While there may be a small cost to a single team to land an un-revert, it avoids a greater cost to the org as a whole. We want to get back to green as quickly as possible and avoid a slippery slope of a team wanting to try “one more fix” before we revert. Teams can also optionally run the acceptance checks on their original PR before merging it, to reduce the risk that the PR will have to be reverted. | ||
|
||
|
||
#### What if these new checks are wrong? | ||
We will need to carefully study the outcomes of this process to ensure that we are getting value from it (finding actual regressions) and not causing unnecessarily churn (raising revert PRs due to false positives when nothing has actually regressed, or the regression actually happened in an earlier payload but went undetected at the time). TRT will track data on how many times this reversion process gets triggered and the outcomes of each incident and then do a retrospective after the 4.10 release. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "will ... after 4.10" -- presumably happened, we probably don't need to repeat? |
||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe soften it a little by changing the order a bit:
TRT will attempt to notify the original author on Slack and in the revert PR. But to avoid staying regressed any longer than necessary, they have privledges to apply the approved label to any revert in the OpenShift org, even for repositories they are not in OWNERS. When the signal is clear regarding the source of a regression TRT has a mandate to restore payloads to healthy state as quickly as possible. This also means if you're reverted, you don't need to respond until your normal working hours.