Skip to content

Commit df6736d

Browse files
authored
Add quarantine and de-quarantine process (#2746)
##### Short description: Have a clear guide on the expected flow when test fail and require quarantining and the how to de-quarantine tests ##### More details: ##### What this PR does / why we need it: ##### Which issue(s) this PR fixes: ##### Special notes for reviewer: ##### jira-ticket: <!-- full-ticket-url needs to be provided. This would add a link to the pull request to the jira and close it when the pull request is merged If the task is not tracked by a Jira ticket, just write "NONE". --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Removed placeholder TODO lines from quarantine guidance. * Added a comprehensive quarantine & de‑quarantine guide: test-stability best practices, a three-step failure analysis and quarantine workflow, failure categorization and messaging, PR/gating requirements, recommended tooling and local verification, a 25-run re-inclusion check, and governance/checklist items. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent d687efb commit df6736d

File tree

2 files changed

+219
-2
lines changed

2 files changed

+219
-2
lines changed

docs/QUARANTINE.md

Lines changed: 0 additions & 2 deletions
This file was deleted.

docs/QUARANTINE_GUIDELINES.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# OpenShift Virtualization Test Stability, Quarantine and De-quarantine Guidelines
2+
3+
## Introduction
4+
5+
This document establishes mandatory procedures for achieving and maintaining dependable **GREEN** test lanes for OpenShift Virtualization Tests (AKA: tier2).
6+
7+
**Why This Matters:**
8+
- Unreliable test results hinder continuous delivery
9+
- Manual investigation wastes significant QE time
10+
- Reduced capacity for new feature development and automation
11+
- Mistrust in test results causes release delays
12+
13+
**Goal**: Enhance testing efficiency and reliability while maintaining full test coverage and high product quality.
14+
15+
## Current Challenges
16+
17+
- **Inconsistent test results**: Uncertainty and delays in release decisions
18+
- **Repeated lane failures**: Block continuous delivery and release schedules
19+
- **Manual investigation overhead**: Time spent determining if failures are bugs or test instability
20+
- **Manual verification burden**: Time on manual testing instead of new automation
21+
- **Troubleshooting waste**: Chasing intermittent issues consumes resources
22+
23+
## Milestone and Branch Management
24+
25+
### Y-Stream (`main` Branch) - EXTRA CAUTION REQUIRED
26+
27+
**For `main` branch (upcoming y-stream release), quarantining requires EXTRA CARE until Code Freeze (CF).**
28+
29+
- Product code may contain bugs until CF
30+
- Stabilization period expected between Feature Freeze and Blockers Only phase
31+
- Extra inspection is needed for ALL quarantine decisions
32+
- Handle test failures with care. Distinguish between product bugs and test issues
33+
34+
### Z-Stream - GREEN Expectation
35+
36+
Z-stream lanes **MUST BE GREEN**:
37+
- No new major features should be introduced
38+
- No API changes should be introduced
39+
- Any failures are likely regressions requiring immediate attention
40+
41+
## Workflow for GREEN Lanes
42+
43+
### Step 1: Review Process
44+
45+
**MANDATORY**: Conduct a review of all failing tests across different lanes until all are GREEN.
46+
47+
**Review checklist:**
48+
- [ ] Identify all failing tests in all lanes
49+
- [ ] Group failures by test or pattern
50+
- [ ] Check for known issues in Jira
51+
- [ ] Analyze new failures immediately
52+
53+
### Step 2: Test Failure Analysis (TFA)
54+
55+
For each failure, determine the root cause category and take appropriate action.
56+
Make sure to:
57+
- Be specific about what failed and why
58+
- Avoid generic messages like "Test failed" or "Assertion error"
59+
- If a product bug was opened, link to the bug in the analysis message
60+
61+
#### Category 1: Product Bug
62+
63+
**Identification**: Failure is due to an actual product defect.
64+
65+
**Actions:**
66+
- Open a bug in Jira with the appropriate priority (and a fix version if not opened on the next y-stream release).
67+
- Use `pytest_jira` integration to skip test conditionally when bug is open. If needed, use `is_jira_open` for conditional skip.
68+
69+
Conditional skip for product bug:
70+
71+
```python
72+
import pytest
73+
74+
@pytest.mark.jira("CNV-12345", run=False) # Skips if the bug is open
75+
def test_feature_a():
76+
pass
77+
```
78+
79+
#### Category 2: Automation Issue
80+
81+
**Identification**: Failure is due to test code or test framework issue.
82+
83+
**Actions:**
84+
- Open a task in Jira under the team's backlog
85+
- **Manual Verification MANDATORY**: Verify that the scenario passes manually before marking analysis as complete
86+
Manual verification is needed to make sure the tested feature does work as expected before quarantining the test.
87+
- Include ALL failure analysis information and logs in Jira
88+
89+
Quarantine for automation issue:
90+
91+
Apply pytest's `xfail` marker with `run=False`, for example:
92+
93+
```python
94+
import pytest
95+
from utilities.constants import QUARANTINED
96+
97+
98+
@pytest.mark.xfail(
99+
reason=f"{QUARANTINED}: VM is going into running state which it shouldn't, CNV-123",
100+
run=False,
101+
)
102+
def test_my_failing_test():
103+
...
104+
```
105+
106+
**Benefit**: Provides pytest summary insights:
107+
108+
```text
109+
1 deselected, 2 passed, 1 quarantined, ...
110+
```
111+
112+
**Jira Requirements:**
113+
- Title starts with `[stabilization]`
114+
- Labels: `quarantined-test`
115+
- Priority based on test importance
116+
- Complete failure analysis documentation
117+
- List any specific actions that should be taken when working on de-quarantining the test
118+
- If backport is needed, make sure to add the info in the ticket
119+
120+
#### Category 3: System / Environment Issue
121+
122+
**Identification**: Failure is due to infrastructure, environment, or external dependencies.
123+
124+
**Actions:**
125+
- **CANNOT QUARANTINE** - Open a ticket to devops
126+
- **Manual Verification MANDATORY**: Verify that the scenario passes manually before marking analysis as complete
127+
- Perform additional analysis
128+
129+
**Additional Analysis Required:**
130+
1. Can the failure be avoided by running test on specific hardware?
131+
2. Can the failure be caught during cluster sanity checks?
132+
3. Open an issue to update cluster sanity to include environment requirements that prevent this failure
133+
134+
**System Issue Examples:**
135+
- Artifactory connection problems
136+
- Network infrastructure issues
137+
- Resource constraints
138+
139+
### Step 3: Quarantine Decision
140+
141+
Determine if a test requires quarantine is based on:
142+
- Failure is not a product bug
143+
- Repeated failures (twice or more in the past 10 runs)
144+
- Root cause requiring extended analysis
145+
- Tests should be migrated between tiers (e.g., tier2 → tier1)
146+
- Automation issue that cannot be fixed immediately
147+
148+
- **Gating test failures** - As gating tests have higher priority, they should be fixed as soon as possible
149+
150+
### When NOT to Quarantine
151+
152+
**DO NOT quarantine**:
153+
- **System issues** - Based on your analysis, open a ticket to DevOps
154+
- **Automation blockers** - Tests that MUST be executed before a release (use blocker tickets instead)
155+
156+
### Blocker Tickets vs. Quarantine
157+
158+
Understanding this distinction is critical:
159+
160+
| Aspect | Blocker Tickets | Quarantine |
161+
|--------------------|-----------------------------------------|------------------------------------------------|
162+
| **Use Case** | Tests MUST run before release | Automation issues, can be temporarily disabled |
163+
| **Urgency** | Fix immediately (within current sprint) | Fix based on priority (may be future sprint) |
164+
| **Release Impact** | Blocks release until resolved | Does not block release |
165+
| **Tracking** | Blocker bug in Jira | Task with `xfail` marker |
166+
| **Fix Timeline** | Current sprint | Based on priority |
167+
168+
### Implementation
169+
170+
**Tools:**
171+
- `pytest.mark.xfail` - For quarantine marking
172+
- `pytest_jira` - For conditional skipping when bugs are open
173+
- `pytest-repeat` - For stability verification (de-quarantine)
174+
175+
### Pull Request Requirements
176+
177+
When submitting quarantine PR:
178+
179+
- **Title**: Must start with `Quarantine:`
180+
- **Description**: Include a link to Jira ticket and brief explanation
181+
- **Backporting**: If needed, make sure to backport the quarantine PR to other branches.
182+
183+
**GitHub Label**: `quarantine` label will be added automatically
184+
185+
186+
## De-Quarantine and Re-inclusion Process
187+
188+
- De-quarantine work must be included in each sprint based on tests priority
189+
190+
### Re-inclusion Criteria
191+
192+
Tests can **ONLY** be re-included after demonstrating stability:
193+
- Test must be verified in Jenkins on a cluster identical (as much as possible) to the one where the failure occurred
194+
- **25 consecutive successful runs via Jenkins** using `pytest-repeat`
195+
- Number can be adjusted based on test flakiness, characteristics and risk; this must be discussed within the sig
196+
197+
**Local Verification Command:**
198+
```bash
199+
pytest --repeat-scope=session --count=25 <path to test module>
200+
```
201+
202+
### De-Quarantine Checklist
203+
204+
Complete ALL items before re-including test:
205+
206+
- [ ] Root cause identified and fixed
207+
- [ ] Test fix implemented and verified locally
208+
- [ ] **Assert messages enhanced** with meaningful information from original failure - if needed
209+
- [ ] `xfail` marker removed from test
210+
- [ ] Jenkins verification passes (by default: 25 consecutive runs)
211+
212+
Once the PR is merged:
213+
- [ ] The test is included in an active test lane
214+
- [ ] Jira ticket updated with:
215+
- [ ] Root cause explanation
216+
- [ ] Fix description
217+
- [ ] Verification results
218+
- [ ] Closed with appropriate resolution
219+
- [ ] If the fix needs to be backported to other branches, each backport PR must be verified using the same verification process

0 commit comments

Comments
 (0)