|
| 1 | +# OpenShift Virtualization Test Stability, Quarantine and De-quarantine Guidelines |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This document establishes mandatory procedures for achieving and maintaining dependable **GREEN** test lanes for OpenShift Virtualization Tests (AKA: tier2). |
| 6 | + |
| 7 | +**Why This Matters:** |
| 8 | +- Unreliable test results hinder continuous delivery |
| 9 | +- Manual investigation wastes significant QE time |
| 10 | +- Reduced capacity for new feature development and automation |
| 11 | +- Mistrust in test results causes release delays |
| 12 | + |
| 13 | +**Goal**: Enhance testing efficiency and reliability while maintaining full test coverage and high product quality. |
| 14 | + |
| 15 | +## Current Challenges |
| 16 | + |
| 17 | +- **Inconsistent test results**: Uncertainty and delays in release decisions |
| 18 | +- **Repeated lane failures**: Block continuous delivery and release schedules |
| 19 | +- **Manual investigation overhead**: Time spent determining if failures are bugs or test instability |
| 20 | +- **Manual verification burden**: Time on manual testing instead of new automation |
| 21 | +- **Troubleshooting waste**: Chasing intermittent issues consumes resources |
| 22 | + |
| 23 | +## Milestone and Branch Management |
| 24 | + |
| 25 | +### Y-Stream (`main` Branch) - EXTRA CAUTION REQUIRED |
| 26 | + |
| 27 | +**For `main` branch (upcoming y-stream release), quarantining requires EXTRA CARE until Code Freeze (CF).** |
| 28 | + |
| 29 | +- Product code may contain bugs until CF |
| 30 | +- Stabilization period expected between Feature Freeze and Blockers Only phase |
| 31 | +- Extra inspection is needed for ALL quarantine decisions |
| 32 | +- Handle test failures with care. Distinguish between product bugs and test issues |
| 33 | + |
| 34 | +### Z-Stream - GREEN Expectation |
| 35 | + |
| 36 | +Z-stream lanes **MUST BE GREEN**: |
| 37 | +- No new major features should be introduced |
| 38 | +- No API changes should be introduced |
| 39 | +- Any failures are likely regressions requiring immediate attention |
| 40 | + |
| 41 | +## Workflow for GREEN Lanes |
| 42 | + |
| 43 | +### Step 1: Review Process |
| 44 | + |
| 45 | +**MANDATORY**: Conduct a review of all failing tests across different lanes until all are GREEN. |
| 46 | + |
| 47 | +**Review checklist:** |
| 48 | +- [ ] Identify all failing tests in all lanes |
| 49 | +- [ ] Group failures by test or pattern |
| 50 | +- [ ] Check for known issues in Jira |
| 51 | +- [ ] Analyze new failures immediately |
| 52 | + |
| 53 | +### Step 2: Test Failure Analysis (TFA) |
| 54 | + |
| 55 | +For each failure, determine the root cause category and take appropriate action. |
| 56 | +Make sure to: |
| 57 | +- Be specific about what failed and why |
| 58 | +- Avoid generic messages like "Test failed" or "Assertion error" |
| 59 | +- If a product bug was opened, link to the bug in the analysis message |
| 60 | + |
| 61 | +#### Category 1: Product Bug |
| 62 | + |
| 63 | +**Identification**: Failure is due to an actual product defect. |
| 64 | + |
| 65 | +**Actions:** |
| 66 | +- Open a bug in Jira with the appropriate priority (and a fix version if not opened on the next y-stream release). |
| 67 | +- Use `pytest_jira` integration to skip test conditionally when bug is open. If needed, use `is_jira_open` for conditional skip. |
| 68 | + |
| 69 | +Conditional skip for product bug: |
| 70 | + |
| 71 | +```python |
| 72 | +import pytest |
| 73 | + |
| 74 | +@pytest.mark.jira("CNV-12345", run=False) # Skips if the bug is open |
| 75 | +def test_feature_a(): |
| 76 | + pass |
| 77 | +``` |
| 78 | + |
| 79 | +#### Category 2: Automation Issue |
| 80 | + |
| 81 | +**Identification**: Failure is due to test code or test framework issue. |
| 82 | + |
| 83 | +**Actions:** |
| 84 | +- Open a task in Jira under the team's backlog |
| 85 | +- **Manual Verification MANDATORY**: Verify that the scenario passes manually before marking analysis as complete |
| 86 | + Manual verification is needed to make sure the tested feature does work as expected before quarantining the test. |
| 87 | +- Include ALL failure analysis information and logs in Jira |
| 88 | + |
| 89 | +Quarantine for automation issue: |
| 90 | + |
| 91 | +Apply pytest's `xfail` marker with `run=False`, for example: |
| 92 | + |
| 93 | +```python |
| 94 | +import pytest |
| 95 | +from utilities.constants import QUARANTINED |
| 96 | + |
| 97 | + |
| 98 | +@pytest.mark.xfail( |
| 99 | + reason=f"{QUARANTINED}: VM is going into running state which it shouldn't, CNV-123", |
| 100 | + run=False, |
| 101 | +) |
| 102 | +def test_my_failing_test(): |
| 103 | + ... |
| 104 | +``` |
| 105 | + |
| 106 | +**Benefit**: Provides pytest summary insights: |
| 107 | + |
| 108 | +```text |
| 109 | +1 deselected, 2 passed, 1 quarantined, ... |
| 110 | +``` |
| 111 | + |
| 112 | +**Jira Requirements:** |
| 113 | +- Title starts with `[stabilization]` |
| 114 | +- Labels: `quarantined-test` |
| 115 | +- Priority based on test importance |
| 116 | +- Complete failure analysis documentation |
| 117 | +- List any specific actions that should be taken when working on de-quarantining the test |
| 118 | +- If backport is needed, make sure to add the info in the ticket |
| 119 | + |
| 120 | +#### Category 3: System / Environment Issue |
| 121 | + |
| 122 | +**Identification**: Failure is due to infrastructure, environment, or external dependencies. |
| 123 | + |
| 124 | +**Actions:** |
| 125 | +- **CANNOT QUARANTINE** - Open a ticket to devops |
| 126 | +- **Manual Verification MANDATORY**: Verify that the scenario passes manually before marking analysis as complete |
| 127 | +- Perform additional analysis |
| 128 | + |
| 129 | +**Additional Analysis Required:** |
| 130 | +1. Can the failure be avoided by running test on specific hardware? |
| 131 | +2. Can the failure be caught during cluster sanity checks? |
| 132 | +3. Open an issue to update cluster sanity to include environment requirements that prevent this failure |
| 133 | + |
| 134 | +**System Issue Examples:** |
| 135 | +- Artifactory connection problems |
| 136 | +- Network infrastructure issues |
| 137 | +- Resource constraints |
| 138 | + |
| 139 | +### Step 3: Quarantine Decision |
| 140 | + |
| 141 | +Determine if a test requires quarantine is based on: |
| 142 | +- Failure is not a product bug |
| 143 | +- Repeated failures (twice or more in the past 10 runs) |
| 144 | +- Root cause requiring extended analysis |
| 145 | +- Tests should be migrated between tiers (e.g., tier2 → tier1) |
| 146 | +- Automation issue that cannot be fixed immediately |
| 147 | + |
| 148 | +- **Gating test failures** - As gating tests have higher priority, they should be fixed as soon as possible |
| 149 | + |
| 150 | +### When NOT to Quarantine |
| 151 | + |
| 152 | +**DO NOT quarantine**: |
| 153 | +- **System issues** - Based on your analysis, open a ticket to DevOps |
| 154 | +- **Automation blockers** - Tests that MUST be executed before a release (use blocker tickets instead) |
| 155 | + |
| 156 | +### Blocker Tickets vs. Quarantine |
| 157 | + |
| 158 | +Understanding this distinction is critical: |
| 159 | + |
| 160 | +| Aspect | Blocker Tickets | Quarantine | |
| 161 | +|--------------------|-----------------------------------------|------------------------------------------------| |
| 162 | +| **Use Case** | Tests MUST run before release | Automation issues, can be temporarily disabled | |
| 163 | +| **Urgency** | Fix immediately (within current sprint) | Fix based on priority (may be future sprint) | |
| 164 | +| **Release Impact** | Blocks release until resolved | Does not block release | |
| 165 | +| **Tracking** | Blocker bug in Jira | Task with `xfail` marker | |
| 166 | +| **Fix Timeline** | Current sprint | Based on priority | |
| 167 | + |
| 168 | +### Implementation |
| 169 | + |
| 170 | +**Tools:** |
| 171 | +- `pytest.mark.xfail` - For quarantine marking |
| 172 | +- `pytest_jira` - For conditional skipping when bugs are open |
| 173 | +- `pytest-repeat` - For stability verification (de-quarantine) |
| 174 | + |
| 175 | +### Pull Request Requirements |
| 176 | + |
| 177 | +When submitting quarantine PR: |
| 178 | + |
| 179 | +- **Title**: Must start with `Quarantine:` |
| 180 | +- **Description**: Include a link to Jira ticket and brief explanation |
| 181 | +- **Backporting**: If needed, make sure to backport the quarantine PR to other branches. |
| 182 | + |
| 183 | +**GitHub Label**: `quarantine` label will be added automatically |
| 184 | + |
| 185 | + |
| 186 | +## De-Quarantine and Re-inclusion Process |
| 187 | + |
| 188 | +- De-quarantine work must be included in each sprint based on tests priority |
| 189 | + |
| 190 | +### Re-inclusion Criteria |
| 191 | + |
| 192 | +Tests can **ONLY** be re-included after demonstrating stability: |
| 193 | +- Test must be verified in Jenkins on a cluster identical (as much as possible) to the one where the failure occurred |
| 194 | +- **25 consecutive successful runs via Jenkins** using `pytest-repeat` |
| 195 | +- Number can be adjusted based on test flakiness, characteristics and risk; this must be discussed within the sig |
| 196 | + |
| 197 | +**Local Verification Command:** |
| 198 | +```bash |
| 199 | +pytest --repeat-scope=session --count=25 <path to test module> |
| 200 | +``` |
| 201 | + |
| 202 | +### De-Quarantine Checklist |
| 203 | + |
| 204 | +Complete ALL items before re-including test: |
| 205 | + |
| 206 | +- [ ] Root cause identified and fixed |
| 207 | +- [ ] Test fix implemented and verified locally |
| 208 | +- [ ] **Assert messages enhanced** with meaningful information from original failure - if needed |
| 209 | +- [ ] `xfail` marker removed from test |
| 210 | +- [ ] Jenkins verification passes (by default: 25 consecutive runs) |
| 211 | + |
| 212 | +Once the PR is merged: |
| 213 | +- [ ] The test is included in an active test lane |
| 214 | +- [ ] Jira ticket updated with: |
| 215 | + - [ ] Root cause explanation |
| 216 | + - [ ] Fix description |
| 217 | + - [ ] Verification results |
| 218 | + - [ ] Closed with appropriate resolution |
| 219 | +- [ ] If the fix needs to be backported to other branches, each backport PR must be verified using the same verification process |
0 commit comments