Conversation
📊 Monthly LeaderboardHi @Rudra-rps! Here's how you rank for March 2026:
Scoring this month (across OWASP-BLT org): Open PRs (+1 each), Merged PRs (+10), Closed (not merged) (−2), Reviews (+5; first two per PR in-month), Comments (+2, excludes CodeRabbit). Run |
|
👋 Hi @Rudra-rps! This pull request needs a peer review before it can be merged. Please request a review from a team member who is not:
Once a valid peer review is submitted, this check will pass automatically. Thank you!
|
🍃 PR Readiness CheckCheck the readiness of this PR on Leaf: Leaf reviews pull requests for operational readiness, security risks, and production-impacting changes before they ship. |
WalkthroughIntroduces a complete flakiness detection pipeline consisting of a GitHub Actions workflow that orchestrates CI result collection, failure analysis, job retries, and flakiness reporting. Includes database schema for storing run history and computed flakiness scores, Python scripts implementing the pipeline stages, comprehensive tests, and integration with the existing PR readiness system to factor flaky failures into CI confidence scoring. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai are all changes inscope or required |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 662 🏁 Script executed: Length of output: 36 🏁 Script executed: Length of output: 12151
No file appears to introduce unrelated concerns or unnecessary scope creep. The |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (5)
scripts/flakiness/tests/test_report_flakiness.py (1)
77-77: Minor typo in test method name.The method name
test_contains_nextg_steps_sectionappears to have a typo - should likely betest_contains_next_steps_section.💅 Suggested fix
- def test_contains_nextg_steps_section(self): + def test_contains_next_steps_section(self):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/tests/test_report_flakiness.py` at line 77, Rename the test function named test_contains_nextg_steps_section to test_contains_next_steps_section in the test_report_flakiness.py file; update the function declaration and any internal or external references (e.g., other tests, fixtures, or test-ids) that refer to test_contains_nextg_steps_section so test discovery and references remain consistent, ensuring no other code relies on the old name.scripts/flakiness/collect_ci_results.py (1)
146-154: Simplify the single-element tuple check.Line 148 uses
if conclusion_category in ('pass',):which is valid but unnecessarily verbose for a single value. Consider simplifying to a direct equality check.♻️ Suggested simplification
- if conclusion_category in ('pass',): + if conclusion_category == 'pass': status = 'pass'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/collect_ci_results.py` around lines 146 - 154, The check using a single-element tuple is overly verbose: in the loop that calls classify_conclusion (the block iterating "for job in jobs"), replace the membership test "if conclusion_category in ('pass',):" with a direct equality comparison "if conclusion_category == 'pass'" to simplify the logic and improve readability; keep the subsequent elif (conclusion_category == 'skip') and else branches unchanged.src/utils.py (1)
439-450: Clarify the reduced penalty logic for edge cases.The condition
checks_failed <= known_flaky_countapplies the reduced penalty even when some failures might be from non-flaky checks. For example, if there are 2 known flaky checks in the repo and 2 current failures, the reduced penalty applies even if both failures are from completely different checks.This is acknowledged in the comment at Lines 479-481 ("we cannot match by check name here because pr_data only carries aggregate counts"), making this a conservative upper-bound approach. Consider adding a brief inline comment here explaining this design trade-off for future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils.py` around lines 439 - 450, Add an inline comment near the reduced-penalty logic (the block using known_flaky_count, checks_failed, and fail_penalty) explaining the design trade-off: because pr_data only provides aggregate counts (not check names), the condition checks_failed <= known_flaky_count is a conservative upper-bound that may wrongly apply the reduced 0.20 penalty when failing checks are different from the known flaky ones; explicitly document this limitation and that this choice intentionally favors weaker penalization in ambiguous cases.scripts/flakiness/tests/test_retry_failures.py (1)
289-318: Consider asserting onmock_markfor test completeness.The static analysis correctly identifies that
mock_markis passed but never used. Sincejob-ais classified asconfirmed_flake,mark_flake_confirmedshould be called once. Adding an assertion would strengthen this test.💚 Suggested assertion addition
self.assertEqual(output['job-a'], 'confirmed_flake') self.assertEqual(output['job-b'], 'real_failure') + # Verify mark_flake_confirmed was called only for the flaky job + mock_mark.assert_called_once() + self.assertEqual(mock_mark.call_args[0][4], 'job-a') # job_name argument🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/tests/test_retry_failures.py` around lines 289 - 318, The test test_multiple_jobs_classified_independently currently patches mark_flake_confirmed as mock_mark but never asserts it was called; add an assertion after the existing output checks to verify mark_flake_confirmed was invoked once for the confirmed flake (e.g., call mock_mark.assert_called_once() or mock_mark.assert_called_once_with(...) if you know the expected args) so the test verifies that retry_failures.mark_flake_confirmed was actually called when job-a is classified as 'confirmed_flake'.scripts/flakiness/report_flakiness.py (1)
331-344: Use the parsed report as the artifact source of truth.
reportis already loaded from--flaky-report/stdin above, but this branch rebuildsall_scoresfrom D1. That can makeflakiness_report.mdandflakiness_metrics.jsondiverge from the exact analysis results the script just processed. If you still need DB-only fields likelast_updated, merge them in after starting fromreport.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/report_flakiness.py` around lines 331 - 344, The dry-run branch replaces the parsed input `report` with fresh DB rows by calling `d1_select`, which causes the generated artifacts to diverge from the exact analysis the script just processed; instead, keep `all_scores` sourced from the already-loaded `report` (use `report.get('flaky', []) + report.get('deterministic', []) + report.get('stable', [])`) and, if you still need DB-only fields such as `last_updated`, merge those fields into the items from `report` by looking them up via `d1_select` results (e.g., keyed by test id) and copying only the missing DB fields into the `report` entries before writing `flakiness_report.md` and `flakiness_metrics.json`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@migrations/0004_create_flakiness_tables.sql`:
- Around line 40-57: The migration seed for the known_infrastructure_issues
table is missing two patterns that are present in flakiness_config.yml; update
the INSERT INTO known_infrastructure_issues block in
migrations/0004_create_flakiness_tables.sql to include entries for the
"dependency" and "upstream" patterns (with category 'infrastructure' and
appropriate brief descriptions) so the database seeding matches the config;
ensure the new rows use INSERT OR IGNORE semantics and follow the same schema
(pattern, category, description) as the existing entries.
In `@scripts/flakiness/analyze_flakiness.py`:
- Around line 200-250: Selection, history lookup and UPSERT must use the same
identity: include workflow_name everywhere; update the history d1_select call to
filter by workflow_name (add AND workflow_name = ? and pass workflow_name in the
params used in the call) and change the UPSERT conflict target in d1_query from
ON CONFLICT(check_name, job_name) to ON CONFLICT(check_name, job_name,
workflow_name) so rows are per (check_name, job_name, workflow_name); touch the
related variables check_name, job_name, workflow_name and the functions
d1_select, analyze_check, d1_query in the loop to ensure the same tuple is used
for selection, analysis and the INSERT/UPDATE.
- Around line 71-102: The classification currently ignores flask_confirmed
counts; change the classification logic so that the 'flaky' branch requires both
failure_rate >= flaky_min AND flaky_count > 0 (i.e., at least one
'flake_confirmed' in window_rows). Concretely, update the conditional sequence
around variables consecutive, failure_rate, flaky_count, consec_det, flaky_max,
flaky_min so that deterministic branches remain the same but the flaky branch
becomes "elif failure_rate >= flaky_min and flaky_count > 0: classification =
'flaky'"; keep flakiness_score and severity computation using _get_severity only
for classification == 'flaky'.
In `@scripts/flakiness/report_flakiness.py`:
- Around line 55-83: The issue is that titles and lookups collapse identity to
check_name only; update _issue_title (and any callers like search_flaky_issue
and create_issue) to include job_name and workflow_name along with check_name
and prefix (e.g., "{prefix} {workflow_name} / {job_name} / {check_name}") so the
generated title is unique per job+workflow+check, and ensure the PR
summary/table generation uses the same composed identity (replace references to
only check_name with the trio job_name, workflow_name, check_name) so
search_flaky_issue finds/compares using the exact same title format.
In `@scripts/flakiness/retry_failures.py`:
- Line 164: The print statement in retry_failures.py currently uses an
unnecessary f-string: change the line printing "[retry] Simulated retry result:
success" to use a normal string (remove the leading `f`) so it becomes a regular
print call; locate the print in the retry/logging area (the print statement
shown as `print(f'[retry] Simulated retry result: success', file=sys.stderr)`)
and remove the `f` prefix.
---
Nitpick comments:
In `@scripts/flakiness/collect_ci_results.py`:
- Around line 146-154: The check using a single-element tuple is overly verbose:
in the loop that calls classify_conclusion (the block iterating "for job in
jobs"), replace the membership test "if conclusion_category in ('pass',):" with
a direct equality comparison "if conclusion_category == 'pass'" to simplify the
logic and improve readability; keep the subsequent elif (conclusion_category ==
'skip') and else branches unchanged.
In `@scripts/flakiness/report_flakiness.py`:
- Around line 331-344: The dry-run branch replaces the parsed input `report`
with fresh DB rows by calling `d1_select`, which causes the generated artifacts
to diverge from the exact analysis the script just processed; instead, keep
`all_scores` sourced from the already-loaded `report` (use `report.get('flaky',
[]) + report.get('deterministic', []) + report.get('stable', [])`) and, if you
still need DB-only fields such as `last_updated`, merge those fields into the
items from `report` by looking them up via `d1_select` results (e.g., keyed by
test id) and copying only the missing DB fields into the `report` entries before
writing `flakiness_report.md` and `flakiness_metrics.json`.
In `@scripts/flakiness/tests/test_report_flakiness.py`:
- Line 77: Rename the test function named test_contains_nextg_steps_section to
test_contains_next_steps_section in the test_report_flakiness.py file; update
the function declaration and any internal or external references (e.g., other
tests, fixtures, or test-ids) that refer to test_contains_nextg_steps_section so
test discovery and references remain consistent, ensuring no other code relies
on the old name.
In `@scripts/flakiness/tests/test_retry_failures.py`:
- Around line 289-318: The test test_multiple_jobs_classified_independently
currently patches mark_flake_confirmed as mock_mark but never asserts it was
called; add an assertion after the existing output checks to verify
mark_flake_confirmed was invoked once for the confirmed flake (e.g., call
mock_mark.assert_called_once() or mock_mark.assert_called_once_with(...) if you
know the expected args) so the test verifies that
retry_failures.mark_flake_confirmed was actually called when job-a is classified
as 'confirmed_flake'.
In `@src/utils.py`:
- Around line 439-450: Add an inline comment near the reduced-penalty logic (the
block using known_flaky_count, checks_failed, and fail_penalty) explaining the
design trade-off: because pr_data only provides aggregate counts (not check
names), the condition checks_failed <= known_flaky_count is a conservative
upper-bound that may wrongly apply the reduced 0.20 penalty when failing checks
are different from the known flaky ones; explicitly document this limitation and
that this choice intentionally favors weaker penalization in ambiguous cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: f37f3649-2296-49b3-8fde-998b8758ebe4
📒 Files selected for processing (19)
.github/workflows/flakiness_detector.ymlmigrations/0004_create_flakiness_tables.sqlpackage.jsonscripts/flakiness/analyze_flakiness.pyscripts/flakiness/collect_ci_results.pyscripts/flakiness/db_utils.pyscripts/flakiness/flakiness_config.ymlscripts/flakiness/report_flakiness.pyscripts/flakiness/retry_failures.pyscripts/flakiness/tests/__init__.pyscripts/flakiness/tests/test_analyze_flakiness.pyscripts/flakiness/tests/test_collect_ci_results.pyscripts/flakiness/tests/test_db_utils.pyscripts/flakiness/tests/test_report_flakiness.pyscripts/flakiness/tests/test_retry_failures.pysrc/cache.pysrc/database.pysrc/handlers.pysrc/utils.py
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
scripts/flakiness/tests/test_analyze_flakiness.py (1)
267-271: Strengthen this upsert assertion against schema drift.This only proves that an upsert exists. It will still pass if
workflow_nameor any of the metric columns disappear from the generated SQL. Assert the full column set expected bymigrations/0004_create_flakiness_tables.sql.Suggested test hardening
sql = mock_query.call_args[0][3] self.assertIn('INSERT INTO flakiness_scores', sql) self.assertIn('ON CONFLICT', sql) + for column in ( + 'check_name', + 'job_name', + 'workflow_name', + 'flakiness_score', + 'severity', + 'classification', + 'total_runs', + 'failure_count', + 'flaky_failures', + 'consecutive_failures', + 'last_updated', + ): + self.assertIn(column, sql)scripts/flakiness/tests/test_report_flakiness.py (1)
77-79: Typo in test method name:test_contains_nextg_steps_sectionThe method name contains a typo — "nextg" should be "next". While this doesn't affect test execution, it impacts discoverability and readability.
✏️ Proposed fix
- def test_contains_nextg_steps_section(self): + def test_contains_next_steps_section(self): body = _build_issue_body(_entry()) self.assertIn('Next steps', body)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/tests/test_report_flakiness.py` around lines 77 - 79, Rename the test method test_contains_nextg_steps_section to test_contains_next_steps_section to fix the typo; update the test function definition where it calls _build_issue_body(_entry()) and asserts 'Next steps' is in body so the test name accurately reflects its intent and improves readability/discoverability.scripts/flakiness/db_utils.py (1)
83-89: Consider defensive handling for missing 'pattern' key.If a
known_infrastructure_issuesrow somehow lacks apatterncolumn, line 89 would raiseKeyError. While unlikely given the schema, a defensive.get()with a filter could improve robustness.🛡️ Defensive approach
def get_infra_patterns(account_id, db_id, token): """Return a list of lowercase infrastructure pattern strings from D1.""" rows = d1_select( account_id, db_id, token, 'SELECT pattern FROM known_infrastructure_issues', ) - return [row['pattern'].lower() for row in rows] + return [row['pattern'].lower() for row in rows if row.get('pattern')]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/db_utils.py` around lines 83 - 89, get_infra_patterns currently assumes every row has a 'pattern' key and will KeyError if not; update the return to defensively extract the value (use row.get('pattern')), filter out None/non-string values, and only call .lower() on legitimate strings so missing or malformed rows are skipped (modify the list comprehension in get_infra_patterns accordingly).scripts/flakiness/tests/test_retry_failures.py (1)
289-317: Unusedmock_markshould verify flake confirmation for multi-job scenario.The test patches
mark_flake_confirmedasmock_markbut doesn't assert on it. Sincejob-ais classified asconfirmed_flake, the test should verifymark_flake_confirmedwas called forjob-abut not forjob-b.✅ Proposed fix to add assertion
self.assertEqual(output['job-a'], 'confirmed_flake') self.assertEqual(output['job-b'], 'real_failure') + # Verify mark_flake_confirmed was called only for the confirmed flake + mock_mark.assert_called_once() + call_args = mock_mark.call_args[0] + self.assertEqual(call_args[4], 'job-a') # job_name argument🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/tests/test_retry_failures.py` around lines 289 - 317, The test_multiple_jobs_classified_independently test patches mark_flake_confirmed as mock_mark but never asserts it was called; update the test to assert mock_mark was called once for 'job-a' and not called for 'job-b' after retry_failures.main() runs by using mock_mark.assert_any_call(...) or equivalent and mock_mark.assert_not_called()/assert_called_once_with for the appropriate arguments matching how mark_flake_confirmed is invoked in the code; ensure you reference the patched mock_mark and the job identifiers 'job-a' and 'job-b' when adding the assertions.scripts/flakiness/collect_ci_results.py (1)
98-99: Consider logging unknown conclusions rather than silently treating them as pass.Unknown conclusions (e.g.,
'action_required','stale') are treated as'pass', which may hide unexpected job states. Consider logging these for observability.🔍 Proposed enhancement
# unknown conclusions (e.g. 'action_required') → treat as pass + if conclusion not in ('', 'success', 'failure', 'skipped', 'cancelled', 'neutral', 'timed_out'): + import sys + print(f'[warn] Unknown conclusion "{conclusion}" for job "{job.get("name")}" — treating as pass', + file=sys.stderr) return 'pass'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/collect_ci_results.py` around lines 98 - 99, The code currently treats any unknown conclusion by silently returning 'pass' (the line with return 'pass'); update this to log the unexpected conclusion value before returning so it's visible in CI logs — e.g., use the module's logger (logger.warning or logging.warn) or print a warning that includes the conclusion variable (e.g., f"Unexpected conclusion: {conclusion} -> treating as 'pass'") and then return 'pass'; place the log immediately before the existing return so behavior is unchanged but observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/flakiness/analyze_flakiness.py`:
- Line 172: The print call uses an unnecessary f-string prefix in the message
printed to sys.stderr (the line containing print(f'[dry-run] → insufficient
data (need ≥5 runs)', file=sys.stderr)); remove the leading "f" so the literal
string is printed (change to print('[dry-run] → insufficient data (need ≥5
runs)', file=sys.stderr)) to avoid misleading use of f-strings.
In `@scripts/flakiness/flakiness_config.yml`:
- Around line 22-23: The YAML contains overly broad infrastructure patterns
"dependency" and "upstream" that will match many non-infra failures; replace
them with narrower, explicit patterns (e.g., more specific labels or prefixed
tokens like "infra:dependency" / "infra:upstream" or anchored regexes such as
^dependency-service- or ^upstream-ci- ) so only true infra-related test failures
are classified as noise; update the entries referencing "dependency" and
"upstream" in flakiness_config.yml (and any consumers that emit those tags) to
the chosen, more specific tokens.
In `@scripts/flakiness/report_flakiness.py`:
- Line 261: The unpacking owner, repo_name = args.repo.split('/', 1) in
report_flakiness.py will crash if args.repo lacks a '/'—add validation before
unpacking: check that args.repo contains a single '/' (or that split('/',1)
returns two parts), and if not call parser.error or raise a clear ValueError
with a message like "Invalid --repo format, expected owner/repo"; update the
parsing flow where args.repo is consumed (same spot as owner, repo_name) to
perform this check and fail fast with a helpful message.
In `@scripts/flakiness/retry_failures.py`:
- Around line 182-183: Validate and handle a malformed args.repo before
unpacking: check that args.repo contains a '/' (e.g., using 'in' or split
length) and, if not, print a clear error or raise a parser error/exit with
non-zero status so the script doesn't crash on unpacking; update the code around
the owner, repo_name = args.repo.split('/', 1) assignment to perform this
validation or wrap it in try/except and emit a helpful message (reference:
args.repo, owner, repo_name, and get_d1_credentials).
In `@src/cache.py`:
- Around line 337-347: The code currently sets _flakiness_cache['data'] = {} on
any exception from get_db/get_all_flakiness_scores which caches a failed D1
read; change it so the cache is only updated on a successful load: move the
assignments to _flakiness_cache['data'] and _flakiness_cache['timestamp'] inside
the try block after scores is retrieved, and in the except block do not write an
empty dict to the cache (just log the error). If you must handle the bootstrap
"table missing" case specially, detect that specific error from
get_all_flakiness_scores and decide whether to cache an explicit sentinel, but
do not globally memoize empty {} on any D1 failure; use the existing symbols
get_db, get_all_flakiness_scores, _flakiness_cache, env, and current_time to
locate and apply the change.
In `@src/database.py`:
- Around line 397-429: The flakiness helpers currently key and query only by
(check_name, job_name) causing collisions; update get_all_flakiness_scores and
get_flakiness_score to include repo (e.g., repository or repo_name) and
workflow_name in the SELECT, the dict key, and the WHERE clause (so keys become
(repo, workflow_name, check_name, job_name) and the prepared statement in
get_flakiness_score binds repo and workflow_name before check_name/job_name);
also update any callers consuming the dict to expect the extended key and modify
the DB schema/index (flakiness_scores table and its indexes) to store and index
repo and workflow_name for correct scoped lookups.
---
Nitpick comments:
In `@scripts/flakiness/collect_ci_results.py`:
- Around line 98-99: The code currently treats any unknown conclusion by
silently returning 'pass' (the line with return 'pass'); update this to log the
unexpected conclusion value before returning so it's visible in CI logs — e.g.,
use the module's logger (logger.warning or logging.warn) or print a warning that
includes the conclusion variable (e.g., f"Unexpected conclusion: {conclusion} ->
treating as 'pass'") and then return 'pass'; place the log immediately before
the existing return so behavior is unchanged but observable.
In `@scripts/flakiness/db_utils.py`:
- Around line 83-89: get_infra_patterns currently assumes every row has a
'pattern' key and will KeyError if not; update the return to defensively extract
the value (use row.get('pattern')), filter out None/non-string values, and only
call .lower() on legitimate strings so missing or malformed rows are skipped
(modify the list comprehension in get_infra_patterns accordingly).
In `@scripts/flakiness/tests/test_report_flakiness.py`:
- Around line 77-79: Rename the test method test_contains_nextg_steps_section to
test_contains_next_steps_section to fix the typo; update the test function
definition where it calls _build_issue_body(_entry()) and asserts 'Next steps'
is in body so the test name accurately reflects its intent and improves
readability/discoverability.
In `@scripts/flakiness/tests/test_retry_failures.py`:
- Around line 289-317: The test_multiple_jobs_classified_independently test
patches mark_flake_confirmed as mock_mark but never asserts it was called;
update the test to assert mock_mark was called once for 'job-a' and not called
for 'job-b' after retry_failures.main() runs by using
mock_mark.assert_any_call(...) or equivalent and
mock_mark.assert_not_called()/assert_called_once_with for the appropriate
arguments matching how mark_flake_confirmed is invoked in the code; ensure you
reference the patched mock_mark and the job identifiers 'job-a' and 'job-b' when
adding the assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 16b17373-fb92-4ce5-baa7-28db482584ad
📒 Files selected for processing (20)
.github/workflows/flakiness_detector.yml.gitignoremigrations/0004_create_flakiness_tables.sqlpackage.jsonscripts/flakiness/analyze_flakiness.pyscripts/flakiness/collect_ci_results.pyscripts/flakiness/db_utils.pyscripts/flakiness/flakiness_config.ymlscripts/flakiness/report_flakiness.pyscripts/flakiness/retry_failures.pyscripts/flakiness/tests/__init__.pyscripts/flakiness/tests/test_analyze_flakiness.pyscripts/flakiness/tests/test_collect_ci_results.pyscripts/flakiness/tests/test_db_utils.pyscripts/flakiness/tests/test_report_flakiness.pyscripts/flakiness/tests/test_retry_failures.pysrc/cache.pysrc/database.pysrc/handlers.pysrc/utils.py
| file=sys.stderr) | ||
| result = analyze_check(history, config) | ||
| if result is None: | ||
| print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) |
There was a problem hiding this comment.
Remove unnecessary f-string prefix.
Line 172 has an f-string without any placeholders.
🔧 Fix
- print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)
+ print('[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) | |
| print('[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) |
🧰 Tools
🪛 Ruff (0.15.5)
[error] 172-172: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/flakiness/analyze_flakiness.py` at line 172, The print call uses an
unnecessary f-string prefix in the message printed to sys.stderr (the line
containing print(f'[dry-run] → insufficient data (need ≥5 runs)',
file=sys.stderr)); remove the leading "f" so the literal string is printed
(change to print('[dry-run] → insufficient data (need ≥5 runs)',
file=sys.stderr)) to avoid misleading use of f-strings.
|
@Rudra-rps Solid concept and great test coverage. A few things to check:
|
infrastructure pattern scope, and validation issues.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/flakiness/analyze_flakiness.py (1)
172-172:⚠️ Potential issue | 🟡 MinorRemove unnecessary f-string prefix.
Line 172 has an f-string without any placeholders. This was flagged in a previous review.
🔧 Proposed fix
- print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) + print('[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/analyze_flakiness.py` at line 172, The print call using an f-string at print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) has no placeholders—remove the unnecessary f prefix and make it a normal string literal in the same print(...) call; update the expression in the analyze_flakiness.py print statement accordingly so it reads print('[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr) without changing behavior.
🧹 Nitpick comments (5)
scripts/flakiness/report_flakiness.py (1)
278-279: Verify: Credentials variables may be unbound on certain code paths.The variables
account_id,db_id,tokenare only assigned whennot args.dry_run(Line 278-279), but they're used in thed1_selectcall on Lines 363-367, which is also guarded byelse(not dry_run). The logic appears correct, but the variable scope could be clearer.Consider initializing to
Nonebefore the conditional or restructuring to make the scope explicit:💡 Optional clarity improvement
+ account_id = db_id = token = None if not args.dry_run: account_id, db_id, token = get_d1_credentials()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/flakiness/report_flakiness.py` around lines 278 - 279, The variables account_id, db_id, and token are only set inside the if not args.dry_run branch which can make their scope unclear; to fix, explicitly initialize account_id, db_id, token = None, None, None before the if or move the d1-related logic (including the d1_select call) into the same not args.dry_run block so get_d1_credentials() and d1_select are in the same scope; update references to args.dry_run, get_d1_credentials, and d1_select accordingly to ensure the variables are always defined when used.src/database.py (1)
414-415: Minor: Use f-string conversion flag for cleaner exception formatting.Per static analysis, prefer
{e!s}over{str(e)}for slightly cleaner syntax.🔧 Proposed fix
except Exception as e: - print(f"Flakiness: Error loading all scores: {str(e)}") + print(f"Flakiness: Error loading all scores: {e!s}") return {}except Exception as e: print( "Flakiness: Error loading score for " - f"{repo}/{workflow_name}/{check_name}/{job_name}: {str(e)}" + f"{repo}/{workflow_name}/{check_name}/{job_name}: {e!s}" )Also applies to: 431-434
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/database.py` around lines 414 - 415, Replace uses of str(e) in exception formatting with the f-string conversion flag {e!s} for cleaner syntax; specifically update the print/error messages that currently use print(f"Flakiness: Error loading all scores: {str(e)}") and the similar block around lines 431-434 to use print(f"Flakiness: Error loading all scores: {e!s}") (or the equivalent logging call) so all exception interpolations use {e!s} instead of str(e).migrations/0005_extend_flakiness_scores.sql (1)
41-44: Note:idx_flakiness_scores_lookupmay duplicate the PRIMARY KEY index.SQLite automatically creates an index for composite PRIMARY KEYs. The explicit
idx_flakiness_scores_lookupon the same columns is likely redundant but harmless. Consider removing it to reduce index maintenance overhead, or keep it if you need a named index for explicit query hints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@migrations/0005_extend_flakiness_scores.sql` around lines 41 - 44, The explicit CREATE INDEX for idx_flakiness_scores_lookup duplicates the composite PRIMARY KEY index on flakiness_scores(repo, workflow_name, check_name, job_name); remove the CREATE INDEX IF NOT EXISTS idx_flakiness_scores_lookup ... statement from the migration to avoid redundant index maintenance (or if you intentionally need a named index for query hints, add a code comment explaining that decision instead of leaving the redundant index in place).src/utils.py (1)
479-492: Consider: Aggregate count approach may be overly lenient.The current logic counts all known-flaky checks for the repo and applies reduced penalty if
checks_failed <= known_flaky_count. This could reduce penalties even when the actual failing checks aren't the known-flaky ones.Since
pr_dataonly carries aggregate counts (not individual check names), this is a pragmatic compromise. The comment at Lines 480-481 acknowledges this as a "conservative upper bound." If more precise matching becomes important later, the readiness endpoint could be enhanced to pass individual check identities.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils.py` around lines 479 - 492, The current aggregate matching can over-credit failures as "known flaky" because known_flaky_count may exceed the repo's failing checks; to avoid reducing penalties incorrectly, clamp known_flaky_count to the reported failing checks by using pr_data.get('checks_failed') when computing the effective flaky count (i.e., effective_flaky = min(known_flaky_count, pr_data.get('checks_failed', 0))); update the computation around known_flaky_count and add a short comment referencing flakiness_scores, pr_data, and checks_failed that explains this pragmatic clamp and TODO to switch to per-check identities when the readiness endpoint can provide them.src/cache.py (1)
341-342: Minor: Use f-string conversion flag.🔧 Proposed fix
except Exception as e: - print(f"Flakiness Cache: D1 load failed ({str(e)}), not caching") + print(f"Flakiness Cache: D1 load failed ({e!s}), not caching") return {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cache.py` around lines 341 - 342, In the except block that prints the D1 load failure message in src/cache.py (the handler that currently does: except Exception as e: print(f"Flakiness Cache: D1 load failed ({str(e)}), not caching")), replace the str(e) call with an f-string conversion flag so the exception is shown with its representation (use {e!r}) to preserve type and detail; keep the same print call and message context but use {e!r} instead of str(e).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@migrations/0005_extend_flakiness_scores.sql`:
- Around line 21-36: The INSERT currently INNER JOINs flakiness_scores to
ci_run_history which will drop any flakiness_scores rows with no matching
ci_run_history; change the JOIN to a LEFT JOIN and use a fallback for repo
(e.g., COALESCE(h.repo, s.repo)) so orphaned scores are preserved when inserting
into flakiness_scores_v2, updating the SELECT to pull repo from h when present
otherwise from s; alternatively, if dropping orphans is intentional add a
comment above the query explaining that behavior.
---
Duplicate comments:
In `@scripts/flakiness/analyze_flakiness.py`:
- Line 172: The print call using an f-string at print(f'[dry-run] →
insufficient data (need ≥5 runs)', file=sys.stderr) has no placeholders—remove
the unnecessary f prefix and make it a normal string literal in the same
print(...) call; update the expression in the analyze_flakiness.py print
statement accordingly so it reads print('[dry-run] → insufficient data (need
≥5 runs)', file=sys.stderr) without changing behavior.
---
Nitpick comments:
In `@migrations/0005_extend_flakiness_scores.sql`:
- Around line 41-44: The explicit CREATE INDEX for idx_flakiness_scores_lookup
duplicates the composite PRIMARY KEY index on flakiness_scores(repo,
workflow_name, check_name, job_name); remove the CREATE INDEX IF NOT EXISTS
idx_flakiness_scores_lookup ... statement from the migration to avoid redundant
index maintenance (or if you intentionally need a named index for query hints,
add a code comment explaining that decision instead of leaving the redundant
index in place).
In `@scripts/flakiness/report_flakiness.py`:
- Around line 278-279: The variables account_id, db_id, and token are only set
inside the if not args.dry_run branch which can make their scope unclear; to
fix, explicitly initialize account_id, db_id, token = None, None, None before
the if or move the d1-related logic (including the d1_select call) into the same
not args.dry_run block so get_d1_credentials() and d1_select are in the same
scope; update references to args.dry_run, get_d1_credentials, and d1_select
accordingly to ensure the variables are always defined when used.
In `@src/cache.py`:
- Around line 341-342: In the except block that prints the D1 load failure
message in src/cache.py (the handler that currently does: except Exception as e:
print(f"Flakiness Cache: D1 load failed ({str(e)}), not caching")), replace the
str(e) call with an f-string conversion flag so the exception is shown with its
representation (use {e!r}) to preserve type and detail; keep the same print call
and message context but use {e!r} instead of str(e).
In `@src/database.py`:
- Around line 414-415: Replace uses of str(e) in exception formatting with the
f-string conversion flag {e!s} for cleaner syntax; specifically update the
print/error messages that currently use print(f"Flakiness: Error loading all
scores: {str(e)}") and the similar block around lines 431-434 to use
print(f"Flakiness: Error loading all scores: {e!s}") (or the equivalent logging
call) so all exception interpolations use {e!s} instead of str(e).
In `@src/utils.py`:
- Around line 479-492: The current aggregate matching can over-credit failures
as "known flaky" because known_flaky_count may exceed the repo's failing checks;
to avoid reducing penalties incorrectly, clamp known_flaky_count to the reported
failing checks by using pr_data.get('checks_failed') when computing the
effective flaky count (i.e., effective_flaky = min(known_flaky_count,
pr_data.get('checks_failed', 0))); update the computation around
known_flaky_count and add a short comment referencing flakiness_scores, pr_data,
and checks_failed that explains this pragmatic clamp and TODO to switch to
per-check identities when the readiness endpoint can provide them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 034ff890-a354-4901-af94-bbf77a7571b0
📒 Files selected for processing (9)
migrations/0004_create_flakiness_tables.sqlmigrations/0005_extend_flakiness_scores.sqlscripts/flakiness/analyze_flakiness.pyscripts/flakiness/flakiness_config.ymlscripts/flakiness/report_flakiness.pyscripts/flakiness/retry_failures.pysrc/cache.pysrc/database.pysrc/utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/flakiness/flakiness_config.yml
- migrations/0004_create_flakiness_tables.sql
The system classifies a check as flaky only when the failure rate crosses the configured threshold and at least one intermittent flaky occurrence is detected. |
also the system doesn't drop workflow_name, it is included in the distinct job combination includes workflow_name, job_name and check_name |
…it directly for a more optimal approach
|
I made a few optimizations to simplify the flakiness schema and lookup flow. • Schema cleanup: flakiness_scores now uses (repo, workflow_name, check_name, job_name) as the primary key in 0004_create_flakiness_tables.sql, with repo as the leading column. The indexes previously introduced in 0005 were also moved into this migration. • Removed redundant migration: Since the schema now includes repo from the start, 0005_extend_flakiness_scores.sql (which previously added the column via a create–copy–rename process) is no longer needed and has been removed. • Simplified flakiness lookup: The in-memory 60-minute TTL cache layer was removed. handlers.py now directly calls get_all_flakiness_scores() instead of using get_cached_flakiness_scores() from cache.py. The behavior remains the same because get_all_flakiness_scores() already returns {} safely on failure. Overall this reduces migration complexity and removes an unnecessary caching layer while keeping the runtime behavior unchanged. |
|
🎓 A mentor has been assigned to this issue! Mentor: @dev-sanidhya @dev-sanidhya — please provide guidance and support. Use @ojaswa072 — @dev-sanidhya will help you through this. Feel free to ask questions here. Use Happy coding! 🚀 — OWASP BLT-Pool |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/handlers.py (1)
1587-1588: Reuse one DB handle inside_run_readiness_analysis.This works, but opening another DB handle at Line 1588 is avoidable. Reusing one local
dbinstance keeps the function tighter and reduces per-call overhead.♻️ Proposed refactor
async def _run_readiness_analysis(env, pr, pr_id, github_token): @@ - # Calculate and update review_status from timeline data + db = get_db(env) + + # Calculate and update review_status from timeline data # This ensures the database has the latest review status without making duplicate API calls original_review_status = pr.get('review_status', 'pending') review_status = calculate_review_status(timeline_data.get('reviews', [])) if review_status != original_review_status: # Update review_status in database only if it actually changed - db = get_db(env) await db.prepare( 'UPDATE prs SET review_status = ?, updated_at = CURRENT_TIMESTAMP WHERE id = ?' ).bind(review_status, pr_id).run() pr['review_status'] = review_status @@ - flakiness_scores = await get_all_flakiness_scores(get_db(env)) + flakiness_scores = await get_all_flakiness_scores(db)Also applies to: 1591-1594
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/handlers.py` around lines 1587 - 1588, In _run_readiness_analysis, avoid opening multiple DB handles by calling get_db(env) once into a local variable (e.g., db) and reuse that handle for subsequent calls such as get_all_flakiness_scores(db), get_latest_member_metrics(db), and get_member_merged_prs_by_repo(db) (and any other DB-using helpers in the same function), replacing the extra get_db(env) invocations so the function reuses a single DB instance for all operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/handlers.py`:
- Around line 1587-1588: In _run_readiness_analysis, avoid opening multiple DB
handles by calling get_db(env) once into a local variable (e.g., db) and reuse
that handle for subsequent calls such as get_all_flakiness_scores(db),
get_latest_member_metrics(db), and get_member_merged_prs_by_repo(db) (and any
other DB-using helpers in the same function), replacing the extra get_db(env)
invocations so the function reuses a single DB instance for all operations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 90ce6247-fe73-4264-94cb-7b5c3ac17ef8
📒 Files selected for processing (2)
migrations/0004_create_flakiness_tables.sqlsrc/handlers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- migrations/0004_create_flakiness_tables.sql
This PR introduces an end-to-end CI flakiness detection system for BLT-Leaf that identifies nondeterministic CI failures and integrates flakiness scores into the PR readiness calculation.
Problem
Currently, all CI failures are treated as deterministic failures. In practice, many failures are caused by transient infrastructure issues or flaky tests, which can unfairly penalize contributors and distort readiness scoring.
Solution
This PR adds a pipeline that:
Collects CI job results
Retries failed jobs to confirm flakes
Computes statistical flakiness scores over a 20-run window
Generates reports and integrates flakiness scores into PR readiness scoring
Key Changes
CI pipeline workflow (flakiness_detector.yml)
D1 schema migration for CI history
Flakiness detection scripts:
collect_ci_results.py
retry_failures.py
analyze_flakiness.py
report_flakiness.py
Worker integration to reduce penalties for known flaky checks
Testing Evidence
103 unit tests passing
Full pipeline verified locally using --dry-run
Retry-based flake detection validated
Flakiness scoring tested on a 20-run window
Worker readiness scoring integration demonstrated
Example behavior:
BEFORE: CI penalty = 0.50 → PR readiness = 61
AFTER: CI penalty = 0.20 → PR readiness = 63
This ensures contributors are not unfairly penalized for intermittent CI issues.
Artifacts generated:
flakiness_report.md
flakiness_metrics.json
Summary by CodeRabbit
Release Notes
New Features
Chores