Skip to content

Flakiness detection init#281

Open
Rudra-rps wants to merge 9 commits intoOWASP-BLT:mainfrom
Rudra-rps:feat/Flakiness_det
Open

Flakiness detection init#281
Rudra-rps wants to merge 9 commits intoOWASP-BLT:mainfrom
Rudra-rps:feat/Flakiness_det

Conversation

@Rudra-rps
Copy link
Copy Markdown
Contributor

@Rudra-rps Rudra-rps commented Mar 11, 2026

This PR introduces an end-to-end CI flakiness detection system for BLT-Leaf that identifies nondeterministic CI failures and integrates flakiness scores into the PR readiness calculation.

Problem

Currently, all CI failures are treated as deterministic failures. In practice, many failures are caused by transient infrastructure issues or flaky tests, which can unfairly penalize contributors and distort readiness scoring.

Solution

This PR adds a pipeline that:

Collects CI job results

Retries failed jobs to confirm flakes

Computes statistical flakiness scores over a 20-run window

Generates reports and integrates flakiness scores into PR readiness scoring

Key Changes

CI pipeline workflow (flakiness_detector.yml)

D1 schema migration for CI history

Flakiness detection scripts:

collect_ci_results.py

retry_failures.py

analyze_flakiness.py

report_flakiness.py

Worker integration to reduce penalties for known flaky checks

Testing Evidence

103 unit tests passing

Full pipeline verified locally using --dry-run

Retry-based flake detection validated

Flakiness scoring tested on a 20-run window

Worker readiness scoring integration demonstrated

Example behavior:

BEFORE: CI penalty = 0.50 → PR readiness = 61
AFTER: CI penalty = 0.20 → PR readiness = 63

This ensures contributors are not unfairly penalized for intermittent CI issues.

Artifacts generated:

flakiness_report.md

flakiness_metrics.json

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automated flakiness detection that identifies unreliable tests and retries failed jobs to confirm flakiness.
    • Flaky tests are now reported as GitHub Issues and PR comments for visibility.
    • PR readiness scoring now accounts for known flaky tests, reducing CI penalties for failures attributed to test instability.
  • Chores

    • Added GitHub Actions workflow for flakiness detection pipeline.
    • Created database schema for CI run history and flakiness metrics.
    • Added configuration and test suite for flakiness detection system.

@owasp-blt
Copy link
Copy Markdown

owasp-blt bot commented Mar 11, 2026

📊 Monthly Leaderboard

Hi @Rudra-rps! Here's how you rank for March 2026:

Rank User Open PRs PRs (merged) PRs (closed) Reviews Comments Total
#23 @SatishKumar620 9 0 0 0 23 55
#24 @Rudra-rps 2 4 1 1 4 53
#25 @Riya-Jain-here 3 2 0 0 10 43

Scoring this month (across OWASP-BLT org): Open PRs (+1 each), Merged PRs (+10), Closed (not merged) (−2), Reviews (+5; first two per PR in-month), Comments (+2, excludes CodeRabbit). Run /leaderboard on any issue or PR to see your rank!

@owasp-blt
Copy link
Copy Markdown

owasp-blt bot commented Mar 11, 2026

👋 Hi @Rudra-rps!

This pull request needs a peer review before it can be merged. Please request a review from a team member who is not:

  • The PR author
  • coderabbitai
  • copilot

Once a valid peer review is submitted, this check will pass automatically. Thank you!

⚠️ Peer review enforcement is active.

@github-actions
Copy link
Copy Markdown

🍃 PR Readiness Check

Check the readiness of this PR on Leaf:
👉 Open on Leaf

Leaf reviews pull requests for operational readiness, security risks, and production-impacting changes before they ship.

@github-actions github-actions bot added the files-changed: 9 PR changes 9 files label Mar 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 11, 2026

Walkthrough

Introduces a complete flakiness detection pipeline consisting of a GitHub Actions workflow that orchestrates CI result collection, failure analysis, job retries, and flakiness reporting. Includes database schema for storing run history and computed flakiness scores, Python scripts implementing the pipeline stages, comprehensive tests, and integration with the existing PR readiness system to factor flaky failures into CI confidence scoring.

Changes

Cohort / File(s) Summary
GitHub Actions Orchestration
.github/workflows/flakiness_detector.yml
New workflow automating the flakiness detection pipeline with metadata resolution, CI result collection, conditional retries, analysis, GitHub reporting, and artifact storage.
Database Schema & Migrations
migrations/0004_create_flakiness_tables.sql
Creates ci_run_history table for logging per-run CI job executions, flakiness_scores for computed metrics per job, and seeds known_infrastructure_issues patterns for classifying infrastructure-related failures.
Flakiness Pipeline Scripts
scripts/flakiness/collect_ci_results.py, scripts/flakiness/analyze_flakiness.py, scripts/flakiness/retry_failures.py, scripts/flakiness/report_flakiness.py, scripts/flakiness/db_utils.py, scripts/flakiness/flakiness_config.yml
Modular scripts implementing collection of CI results from GitHub, statistical analysis with multi-step classification logic (deterministic/flaky/stable), orchestrated retries via GitHub API, flakiness reporting to issues/PRs with file generation, and D1 REST API client utilities. Configuration defines thresholds for analysis window, flaky/deterministic failure-rate bands, severity levels, and GitHub behavior.
Flakiness Tests
scripts/flakiness/tests/test_*.py
Comprehensive test suite covering classify_conclusion logic, severity mapping, multi-step flakiness classification, GitHub API interactions, D1 queries, database utilities, issue/PR commenting, and markdown report generation; includes dry-run and integration scenarios.
PR Readiness Integration
src/database.py
New async helpers to retrieve all flakiness scores and lookup specific (repo, workflow, check, job) entries from the database.
Readiness Calculation
src/handlers.py, src/utils.py
Handler loads flakiness scores and passes them to PR readiness calculation. Readiness logic extended to compute known_flaky_count and apply reduced penalties (0.20 vs 0.50) for known flaky failures; flaky failures <= known count are converted to non-blocking warnings instead of blockers; CI confidence scoring accounts for flaky job count.
Configuration & Metadata
package.json, .gitignore
Added test:flakiness npm script. Gitignore extended to exclude flakiness report artifacts (data/flakiness_report.md, data/flakiness_metrics.json).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

quality: high

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Flakiness detection init' directly and clearly describes the main objective of this PR: initializing a comprehensive CI flakiness detection system with workflows, database schema, analysis scripts, and worker integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added files-changed: 12 PR changes 12 files and removed files-changed: 9 PR changes 9 files labels Mar 11, 2026
@github-actions github-actions bot added files-changed: 19 PR changes 19 files and removed files-changed: 12 PR changes 12 files labels Mar 12, 2026
@Rudra-rps Rudra-rps marked this pull request as ready for review March 12, 2026 18:28
@github-actions github-actions bot added files-changed: 20 PR changes 20 files and removed files-changed: 19 PR changes 19 files labels Mar 12, 2026
@S3DFX-CYBER
Copy link
Copy Markdown
Contributor

@coderabbitai are all changes inscope or required

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get list of files changed in PR `#281`
gh pr view 281 --repo OWASP-BLT/BLT-Leaf --json files --jq '.files[].path'

Length of output: 662


🏁 Script executed:

#!/bin/bash
# Get the diff for existing modified files
gh pr diff 281 --repo OWASP-BLT/BLT-Leaf -- src/cache.py src/database.py src/handlers.py src/utils.py package.json .gitignore

Length of output: 36


🏁 Script executed:

#!/bin/bash
# Get full PR diff
gh pr diff 281 --repo OWASP-BLT/BLT-Leaf > /tmp/pr281.diff 2>&1

echo "=== src/cache.py diff ==="
grep -A 60 'diff --git a/src/cache.py' /tmp/pr281.diff | head -80

echo "=== src/database.py diff ==="
grep -A 60 'diff --git a/src/database.py' /tmp/pr281.diff | head -80

echo "=== src/handlers.py diff ==="
grep -A 80 'diff --git a/src/handlers.py' /tmp/pr281.diff | head -100

echo "=== src/utils.py diff ==="
grep -A 60 'diff --git a/src/utils.py' /tmp/pr281.diff | head -80

echo "=== package.json diff ==="
grep -A 30 'diff --git a/package.json' /tmp/pr281.diff | head -40

echo "=== .gitignore diff ==="
grep -A 20 'diff --git a/.gitignore' /tmp/pr281.diff | head -30

Length of output: 12151


@S3DFX-CYBER: Yes, all 20 changed files appear to be in-scope and required for the stated PR objective. Here's the breakdown:

Category Files Verdict
CI Workflow .github/workflows/flakiness_detector.yml ✅ Core pipeline
DB Migration migrations/0004_create_flakiness_tables.sql ✅ Required schema for new tables (ci_run_history, flakiness_scores)
Scripts scripts/flakiness/collect_ci_results.py, analyze_flakiness.py, retry_failures.py, report_flakiness.py, db_utils.py, flakiness_config.yml ✅ Core detection pipeline
Tests scripts/flakiness/tests/test_*.py + __init__.py ✅ 103 unit tests for all scripts
Worker: cache.py Adds get_cached_flakiness_scores() with 60-min in-memory TTL ✅ Required to avoid repeated D1 lookups per request
Worker: database.py Adds get_all_flakiness_scores() and get_flakiness_score() ✅ Required to read from new D1 tables
Worker: handlers.py Loads flakiness scores and passes them to calculate_pr_readiness() ✅ Core worker integration
Worker: utils.py Modifies calculate_ci_confidence() to apply reduced penalty (0.50 → 0.20) for known-flaky failures ✅ The primary scoring change described in the PR
package.json Adds test:flakiness npm script ✅ Appropriate convenience script
.gitignore Excludes data/flakiness_report.md and data/flakiness_metrics.json ✅ Correct — these are generated artifacts

No file appears to introduce unrelated concerns or unnecessary scope creep. The src/ changes are tightly coupled to the flakiness feature and wouldn't function without the scripts and migration. The scoring change in src/utils.py (reduced CI penalty for flaky checks) is exactly what the PR advertises.

@ojaswa072
Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
scripts/flakiness/tests/test_report_flakiness.py (1)

77-77: Minor typo in test method name.

The method name test_contains_nextg_steps_section appears to have a typo - should likely be test_contains_next_steps_section.

💅 Suggested fix
-    def test_contains_nextg_steps_section(self):
+    def test_contains_next_steps_section(self):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/tests/test_report_flakiness.py` at line 77, Rename the test
function named test_contains_nextg_steps_section to
test_contains_next_steps_section in the test_report_flakiness.py file; update
the function declaration and any internal or external references (e.g., other
tests, fixtures, or test-ids) that refer to test_contains_nextg_steps_section so
test discovery and references remain consistent, ensuring no other code relies
on the old name.
scripts/flakiness/collect_ci_results.py (1)

146-154: Simplify the single-element tuple check.

Line 148 uses if conclusion_category in ('pass',): which is valid but unnecessarily verbose for a single value. Consider simplifying to a direct equality check.

♻️ Suggested simplification
-        if conclusion_category in ('pass',):
+        if conclusion_category == 'pass':
             status = 'pass'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/collect_ci_results.py` around lines 146 - 154, The check
using a single-element tuple is overly verbose: in the loop that calls
classify_conclusion (the block iterating "for job in jobs"), replace the
membership test "if conclusion_category in ('pass',):" with a direct equality
comparison "if conclusion_category == 'pass'" to simplify the logic and improve
readability; keep the subsequent elif (conclusion_category == 'skip') and else
branches unchanged.
src/utils.py (1)

439-450: Clarify the reduced penalty logic for edge cases.

The condition checks_failed <= known_flaky_count applies the reduced penalty even when some failures might be from non-flaky checks. For example, if there are 2 known flaky checks in the repo and 2 current failures, the reduced penalty applies even if both failures are from completely different checks.

This is acknowledged in the comment at Lines 479-481 ("we cannot match by check name here because pr_data only carries aggregate counts"), making this a conservative upper-bound approach. Consider adding a brief inline comment here explaining this design trade-off for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils.py` around lines 439 - 450, Add an inline comment near the
reduced-penalty logic (the block using known_flaky_count, checks_failed, and
fail_penalty) explaining the design trade-off: because pr_data only provides
aggregate counts (not check names), the condition checks_failed <=
known_flaky_count is a conservative upper-bound that may wrongly apply the
reduced 0.20 penalty when failing checks are different from the known flaky
ones; explicitly document this limitation and that this choice intentionally
favors weaker penalization in ambiguous cases.
scripts/flakiness/tests/test_retry_failures.py (1)

289-318: Consider asserting on mock_mark for test completeness.

The static analysis correctly identifies that mock_mark is passed but never used. Since job-a is classified as confirmed_flake, mark_flake_confirmed should be called once. Adding an assertion would strengthen this test.

💚 Suggested assertion addition
         self.assertEqual(output['job-a'], 'confirmed_flake')
         self.assertEqual(output['job-b'], 'real_failure')
+        # Verify mark_flake_confirmed was called only for the flaky job
+        mock_mark.assert_called_once()
+        self.assertEqual(mock_mark.call_args[0][4], 'job-a')  # job_name argument
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/tests/test_retry_failures.py` around lines 289 - 318, The
test test_multiple_jobs_classified_independently currently patches
mark_flake_confirmed as mock_mark but never asserts it was called; add an
assertion after the existing output checks to verify mark_flake_confirmed was
invoked once for the confirmed flake (e.g., call mock_mark.assert_called_once()
or mock_mark.assert_called_once_with(...) if you know the expected args) so the
test verifies that retry_failures.mark_flake_confirmed was actually called when
job-a is classified as 'confirmed_flake'.
scripts/flakiness/report_flakiness.py (1)

331-344: Use the parsed report as the artifact source of truth.

report is already loaded from --flaky-report/stdin above, but this branch rebuilds all_scores from D1. That can make flakiness_report.md and flakiness_metrics.json diverge from the exact analysis results the script just processed. If you still need DB-only fields like last_updated, merge them in after starting from report.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/report_flakiness.py` around lines 331 - 344, The dry-run
branch replaces the parsed input `report` with fresh DB rows by calling
`d1_select`, which causes the generated artifacts to diverge from the exact
analysis the script just processed; instead, keep `all_scores` sourced from the
already-loaded `report` (use `report.get('flaky', []) +
report.get('deterministic', []) + report.get('stable', [])`) and, if you still
need DB-only fields such as `last_updated`, merge those fields into the items
from `report` by looking them up via `d1_select` results (e.g., keyed by test
id) and copying only the missing DB fields into the `report` entries before
writing `flakiness_report.md` and `flakiness_metrics.json`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@migrations/0004_create_flakiness_tables.sql`:
- Around line 40-57: The migration seed for the known_infrastructure_issues
table is missing two patterns that are present in flakiness_config.yml; update
the INSERT INTO known_infrastructure_issues block in
migrations/0004_create_flakiness_tables.sql to include entries for the
"dependency" and "upstream" patterns (with category 'infrastructure' and
appropriate brief descriptions) so the database seeding matches the config;
ensure the new rows use INSERT OR IGNORE semantics and follow the same schema
(pattern, category, description) as the existing entries.

In `@scripts/flakiness/analyze_flakiness.py`:
- Around line 200-250: Selection, history lookup and UPSERT must use the same
identity: include workflow_name everywhere; update the history d1_select call to
filter by workflow_name (add AND workflow_name = ? and pass workflow_name in the
params used in the call) and change the UPSERT conflict target in d1_query from
ON CONFLICT(check_name, job_name) to ON CONFLICT(check_name, job_name,
workflow_name) so rows are per (check_name, job_name, workflow_name); touch the
related variables check_name, job_name, workflow_name and the functions
d1_select, analyze_check, d1_query in the loop to ensure the same tuple is used
for selection, analysis and the INSERT/UPDATE.
- Around line 71-102: The classification currently ignores flask_confirmed
counts; change the classification logic so that the 'flaky' branch requires both
failure_rate >= flaky_min AND flaky_count > 0 (i.e., at least one
'flake_confirmed' in window_rows). Concretely, update the conditional sequence
around variables consecutive, failure_rate, flaky_count, consec_det, flaky_max,
flaky_min so that deterministic branches remain the same but the flaky branch
becomes "elif failure_rate >= flaky_min and flaky_count > 0: classification =
'flaky'"; keep flakiness_score and severity computation using _get_severity only
for classification == 'flaky'.

In `@scripts/flakiness/report_flakiness.py`:
- Around line 55-83: The issue is that titles and lookups collapse identity to
check_name only; update _issue_title (and any callers like search_flaky_issue
and create_issue) to include job_name and workflow_name along with check_name
and prefix (e.g., "{prefix} {workflow_name} / {job_name} / {check_name}") so the
generated title is unique per job+workflow+check, and ensure the PR
summary/table generation uses the same composed identity (replace references to
only check_name with the trio job_name, workflow_name, check_name) so
search_flaky_issue finds/compares using the exact same title format.

In `@scripts/flakiness/retry_failures.py`:
- Line 164: The print statement in retry_failures.py currently uses an
unnecessary f-string: change the line printing "[retry] Simulated retry result:
success" to use a normal string (remove the leading `f`) so it becomes a regular
print call; locate the print in the retry/logging area (the print statement
shown as `print(f'[retry] Simulated retry result: success', file=sys.stderr)`)
and remove the `f` prefix.

---

Nitpick comments:
In `@scripts/flakiness/collect_ci_results.py`:
- Around line 146-154: The check using a single-element tuple is overly verbose:
in the loop that calls classify_conclusion (the block iterating "for job in
jobs"), replace the membership test "if conclusion_category in ('pass',):" with
a direct equality comparison "if conclusion_category == 'pass'" to simplify the
logic and improve readability; keep the subsequent elif (conclusion_category ==
'skip') and else branches unchanged.

In `@scripts/flakiness/report_flakiness.py`:
- Around line 331-344: The dry-run branch replaces the parsed input `report`
with fresh DB rows by calling `d1_select`, which causes the generated artifacts
to diverge from the exact analysis the script just processed; instead, keep
`all_scores` sourced from the already-loaded `report` (use `report.get('flaky',
[]) + report.get('deterministic', []) + report.get('stable', [])`) and, if you
still need DB-only fields such as `last_updated`, merge those fields into the
items from `report` by looking them up via `d1_select` results (e.g., keyed by
test id) and copying only the missing DB fields into the `report` entries before
writing `flakiness_report.md` and `flakiness_metrics.json`.

In `@scripts/flakiness/tests/test_report_flakiness.py`:
- Line 77: Rename the test function named test_contains_nextg_steps_section to
test_contains_next_steps_section in the test_report_flakiness.py file; update
the function declaration and any internal or external references (e.g., other
tests, fixtures, or test-ids) that refer to test_contains_nextg_steps_section so
test discovery and references remain consistent, ensuring no other code relies
on the old name.

In `@scripts/flakiness/tests/test_retry_failures.py`:
- Around line 289-318: The test test_multiple_jobs_classified_independently
currently patches mark_flake_confirmed as mock_mark but never asserts it was
called; add an assertion after the existing output checks to verify
mark_flake_confirmed was invoked once for the confirmed flake (e.g., call
mock_mark.assert_called_once() or mock_mark.assert_called_once_with(...) if you
know the expected args) so the test verifies that
retry_failures.mark_flake_confirmed was actually called when job-a is classified
as 'confirmed_flake'.

In `@src/utils.py`:
- Around line 439-450: Add an inline comment near the reduced-penalty logic (the
block using known_flaky_count, checks_failed, and fail_penalty) explaining the
design trade-off: because pr_data only provides aggregate counts (not check
names), the condition checks_failed <= known_flaky_count is a conservative
upper-bound that may wrongly apply the reduced 0.20 penalty when failing checks
are different from the known flaky ones; explicitly document this limitation and
that this choice intentionally favors weaker penalization in ambiguous cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f37f3649-2296-49b3-8fde-998b8758ebe4

📥 Commits

Reviewing files that changed from the base of the PR and between a22a666 and 71784d7.

📒 Files selected for processing (19)
  • .github/workflows/flakiness_detector.yml
  • migrations/0004_create_flakiness_tables.sql
  • package.json
  • scripts/flakiness/analyze_flakiness.py
  • scripts/flakiness/collect_ci_results.py
  • scripts/flakiness/db_utils.py
  • scripts/flakiness/flakiness_config.yml
  • scripts/flakiness/report_flakiness.py
  • scripts/flakiness/retry_failures.py
  • scripts/flakiness/tests/__init__.py
  • scripts/flakiness/tests/test_analyze_flakiness.py
  • scripts/flakiness/tests/test_collect_ci_results.py
  • scripts/flakiness/tests/test_db_utils.py
  • scripts/flakiness/tests/test_report_flakiness.py
  • scripts/flakiness/tests/test_retry_failures.py
  • src/cache.py
  • src/database.py
  • src/handlers.py
  • src/utils.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
scripts/flakiness/tests/test_analyze_flakiness.py (1)

267-271: Strengthen this upsert assertion against schema drift.

This only proves that an upsert exists. It will still pass if workflow_name or any of the metric columns disappear from the generated SQL. Assert the full column set expected by migrations/0004_create_flakiness_tables.sql.

Suggested test hardening
         sql = mock_query.call_args[0][3]
         self.assertIn('INSERT INTO flakiness_scores', sql)
         self.assertIn('ON CONFLICT', sql)
+        for column in (
+            'check_name',
+            'job_name',
+            'workflow_name',
+            'flakiness_score',
+            'severity',
+            'classification',
+            'total_runs',
+            'failure_count',
+            'flaky_failures',
+            'consecutive_failures',
+            'last_updated',
+        ):
+            self.assertIn(column, sql)
scripts/flakiness/tests/test_report_flakiness.py (1)

77-79: Typo in test method name: test_contains_nextg_steps_section

The method name contains a typo — "nextg" should be "next". While this doesn't affect test execution, it impacts discoverability and readability.

✏️ Proposed fix
-    def test_contains_nextg_steps_section(self):
+    def test_contains_next_steps_section(self):
         body = _build_issue_body(_entry())
         self.assertIn('Next steps', body)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/tests/test_report_flakiness.py` around lines 77 - 79,
Rename the test method test_contains_nextg_steps_section to
test_contains_next_steps_section to fix the typo; update the test function
definition where it calls _build_issue_body(_entry()) and asserts 'Next steps'
is in body so the test name accurately reflects its intent and improves
readability/discoverability.
scripts/flakiness/db_utils.py (1)

83-89: Consider defensive handling for missing 'pattern' key.

If a known_infrastructure_issues row somehow lacks a pattern column, line 89 would raise KeyError. While unlikely given the schema, a defensive .get() with a filter could improve robustness.

🛡️ Defensive approach
 def get_infra_patterns(account_id, db_id, token):
     """Return a list of lowercase infrastructure pattern strings from D1."""
     rows = d1_select(
         account_id, db_id, token,
         'SELECT pattern FROM known_infrastructure_issues',
     )
-    return [row['pattern'].lower() for row in rows]
+    return [row['pattern'].lower() for row in rows if row.get('pattern')]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/db_utils.py` around lines 83 - 89, get_infra_patterns
currently assumes every row has a 'pattern' key and will KeyError if not; update
the return to defensively extract the value (use row.get('pattern')), filter out
None/non-string values, and only call .lower() on legitimate strings so missing
or malformed rows are skipped (modify the list comprehension in
get_infra_patterns accordingly).
scripts/flakiness/tests/test_retry_failures.py (1)

289-317: Unused mock_mark should verify flake confirmation for multi-job scenario.

The test patches mark_flake_confirmed as mock_mark but doesn't assert on it. Since job-a is classified as confirmed_flake, the test should verify mark_flake_confirmed was called for job-a but not for job-b.

✅ Proposed fix to add assertion
         self.assertEqual(output['job-a'], 'confirmed_flake')
         self.assertEqual(output['job-b'], 'real_failure')
+        # Verify mark_flake_confirmed was called only for the confirmed flake
+        mock_mark.assert_called_once()
+        call_args = mock_mark.call_args[0]
+        self.assertEqual(call_args[4], 'job-a')  # job_name argument
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/tests/test_retry_failures.py` around lines 289 - 317, The
test_multiple_jobs_classified_independently test patches mark_flake_confirmed as
mock_mark but never asserts it was called; update the test to assert mock_mark
was called once for 'job-a' and not called for 'job-b' after
retry_failures.main() runs by using mock_mark.assert_any_call(...) or equivalent
and mock_mark.assert_not_called()/assert_called_once_with for the appropriate
arguments matching how mark_flake_confirmed is invoked in the code; ensure you
reference the patched mock_mark and the job identifiers 'job-a' and 'job-b' when
adding the assertions.
scripts/flakiness/collect_ci_results.py (1)

98-99: Consider logging unknown conclusions rather than silently treating them as pass.

Unknown conclusions (e.g., 'action_required', 'stale') are treated as 'pass', which may hide unexpected job states. Consider logging these for observability.

🔍 Proposed enhancement
     # unknown conclusions (e.g. 'action_required') → treat as pass
+    if conclusion not in ('', 'success', 'failure', 'skipped', 'cancelled', 'neutral', 'timed_out'):
+        import sys
+        print(f'[warn] Unknown conclusion "{conclusion}" for job "{job.get("name")}" — treating as pass',
+              file=sys.stderr)
     return 'pass'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/collect_ci_results.py` around lines 98 - 99, The code
currently treats any unknown conclusion by silently returning 'pass' (the line
with return 'pass'); update this to log the unexpected conclusion value before
returning so it's visible in CI logs — e.g., use the module's logger
(logger.warning or logging.warn) or print a warning that includes the conclusion
variable (e.g., f"Unexpected conclusion: {conclusion} -> treating as 'pass'")
and then return 'pass'; place the log immediately before the existing return so
behavior is unchanged but observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/flakiness/analyze_flakiness.py`:
- Line 172: The print call uses an unnecessary f-string prefix in the message
printed to sys.stderr (the line containing print(f'[dry-run]   → insufficient
data (need ≥5 runs)', file=sys.stderr)); remove the leading "f" so the literal
string is printed (change to print('[dry-run]   → insufficient data (need ≥5
runs)', file=sys.stderr)) to avoid misleading use of f-strings.

In `@scripts/flakiness/flakiness_config.yml`:
- Around line 22-23: The YAML contains overly broad infrastructure patterns
"dependency" and "upstream" that will match many non-infra failures; replace
them with narrower, explicit patterns (e.g., more specific labels or prefixed
tokens like "infra:dependency" / "infra:upstream" or anchored regexes such as
^dependency-service- or ^upstream-ci- ) so only true infra-related test failures
are classified as noise; update the entries referencing "dependency" and
"upstream" in flakiness_config.yml (and any consumers that emit those tags) to
the chosen, more specific tokens.

In `@scripts/flakiness/report_flakiness.py`:
- Line 261: The unpacking owner, repo_name = args.repo.split('/', 1) in
report_flakiness.py will crash if args.repo lacks a '/'—add validation before
unpacking: check that args.repo contains a single '/' (or that split('/',1)
returns two parts), and if not call parser.error or raise a clear ValueError
with a message like "Invalid --repo format, expected owner/repo"; update the
parsing flow where args.repo is consumed (same spot as owner, repo_name) to
perform this check and fail fast with a helpful message.

In `@scripts/flakiness/retry_failures.py`:
- Around line 182-183: Validate and handle a malformed args.repo before
unpacking: check that args.repo contains a '/' (e.g., using 'in' or split
length) and, if not, print a clear error or raise a parser error/exit with
non-zero status so the script doesn't crash on unpacking; update the code around
the owner, repo_name = args.repo.split('/', 1) assignment to perform this
validation or wrap it in try/except and emit a helpful message (reference:
args.repo, owner, repo_name, and get_d1_credentials).

In `@src/cache.py`:
- Around line 337-347: The code currently sets _flakiness_cache['data'] = {} on
any exception from get_db/get_all_flakiness_scores which caches a failed D1
read; change it so the cache is only updated on a successful load: move the
assignments to _flakiness_cache['data'] and _flakiness_cache['timestamp'] inside
the try block after scores is retrieved, and in the except block do not write an
empty dict to the cache (just log the error). If you must handle the bootstrap
"table missing" case specially, detect that specific error from
get_all_flakiness_scores and decide whether to cache an explicit sentinel, but
do not globally memoize empty {} on any D1 failure; use the existing symbols
get_db, get_all_flakiness_scores, _flakiness_cache, env, and current_time to
locate and apply the change.

In `@src/database.py`:
- Around line 397-429: The flakiness helpers currently key and query only by
(check_name, job_name) causing collisions; update get_all_flakiness_scores and
get_flakiness_score to include repo (e.g., repository or repo_name) and
workflow_name in the SELECT, the dict key, and the WHERE clause (so keys become
(repo, workflow_name, check_name, job_name) and the prepared statement in
get_flakiness_score binds repo and workflow_name before check_name/job_name);
also update any callers consuming the dict to expect the extended key and modify
the DB schema/index (flakiness_scores table and its indexes) to store and index
repo and workflow_name for correct scoped lookups.

---

Nitpick comments:
In `@scripts/flakiness/collect_ci_results.py`:
- Around line 98-99: The code currently treats any unknown conclusion by
silently returning 'pass' (the line with return 'pass'); update this to log the
unexpected conclusion value before returning so it's visible in CI logs — e.g.,
use the module's logger (logger.warning or logging.warn) or print a warning that
includes the conclusion variable (e.g., f"Unexpected conclusion: {conclusion} ->
treating as 'pass'") and then return 'pass'; place the log immediately before
the existing return so behavior is unchanged but observable.

In `@scripts/flakiness/db_utils.py`:
- Around line 83-89: get_infra_patterns currently assumes every row has a
'pattern' key and will KeyError if not; update the return to defensively extract
the value (use row.get('pattern')), filter out None/non-string values, and only
call .lower() on legitimate strings so missing or malformed rows are skipped
(modify the list comprehension in get_infra_patterns accordingly).

In `@scripts/flakiness/tests/test_report_flakiness.py`:
- Around line 77-79: Rename the test method test_contains_nextg_steps_section to
test_contains_next_steps_section to fix the typo; update the test function
definition where it calls _build_issue_body(_entry()) and asserts 'Next steps'
is in body so the test name accurately reflects its intent and improves
readability/discoverability.

In `@scripts/flakiness/tests/test_retry_failures.py`:
- Around line 289-317: The test_multiple_jobs_classified_independently test
patches mark_flake_confirmed as mock_mark but never asserts it was called;
update the test to assert mock_mark was called once for 'job-a' and not called
for 'job-b' after retry_failures.main() runs by using
mock_mark.assert_any_call(...) or equivalent and
mock_mark.assert_not_called()/assert_called_once_with for the appropriate
arguments matching how mark_flake_confirmed is invoked in the code; ensure you
reference the patched mock_mark and the job identifiers 'job-a' and 'job-b' when
adding the assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 16b17373-fb92-4ce5-baa7-28db482584ad

📥 Commits

Reviewing files that changed from the base of the PR and between a22a666 and 93a0dd1.

📒 Files selected for processing (20)
  • .github/workflows/flakiness_detector.yml
  • .gitignore
  • migrations/0004_create_flakiness_tables.sql
  • package.json
  • scripts/flakiness/analyze_flakiness.py
  • scripts/flakiness/collect_ci_results.py
  • scripts/flakiness/db_utils.py
  • scripts/flakiness/flakiness_config.yml
  • scripts/flakiness/report_flakiness.py
  • scripts/flakiness/retry_failures.py
  • scripts/flakiness/tests/__init__.py
  • scripts/flakiness/tests/test_analyze_flakiness.py
  • scripts/flakiness/tests/test_collect_ci_results.py
  • scripts/flakiness/tests/test_db_utils.py
  • scripts/flakiness/tests/test_report_flakiness.py
  • scripts/flakiness/tests/test_retry_failures.py
  • src/cache.py
  • src/database.py
  • src/handlers.py
  • src/utils.py

file=sys.stderr)
result = analyze_check(history, config)
if result is None:
print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

Line 172 has an f-string without any placeholders.

🔧 Fix
-                print(f'[dry-run]   → insufficient data (need ≥5 runs)', file=sys.stderr)
+                print('[dry-run]   → insufficient data (need ≥5 runs)', file=sys.stderr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f'[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)
print('[dry-run] → insufficient data (need ≥5 runs)', file=sys.stderr)
🧰 Tools
🪛 Ruff (0.15.5)

[error] 172-172: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/analyze_flakiness.py` at line 172, The print call uses an
unnecessary f-string prefix in the message printed to sys.stderr (the line
containing print(f'[dry-run]   → insufficient data (need ≥5 runs)',
file=sys.stderr)); remove the leading "f" so the literal string is printed
(change to print('[dry-run]   → insufficient data (need ≥5 runs)',
file=sys.stderr)) to avoid misleading use of f-strings.

@ojaswa072
Copy link
Copy Markdown
Contributor

@Rudra-rps Solid concept and great test coverage. A few things to check:

  • Is flaky_count ever used in classification? Unconfirmed flakes could still reduce CI penalties
  • Is workflow_name intentionally dropped in history lookup?
  • Should issue identity include job_name too?
  • Migration missing dependency and upstream patterns from config - intentional?

infrastructure pattern scope, and validation issues.
@github-actions github-actions bot added files-changed: 21 PR changes 21 files and removed files-changed: 20 PR changes 20 files labels Mar 13, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
scripts/flakiness/analyze_flakiness.py (1)

172-172: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

Line 172 has an f-string without any placeholders. This was flagged in a previous review.

🔧 Proposed fix
-                print(f'[dry-run]   → insufficient data (need ≥5 runs)', file=sys.stderr)
+                print('[dry-run]   → insufficient data (need ≥5 runs)', file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/analyze_flakiness.py` at line 172, The print call using an
f-string at print(f'[dry-run]   → insufficient data (need ≥5 runs)',
file=sys.stderr) has no placeholders—remove the unnecessary f prefix and make it
a normal string literal in the same print(...) call; update the expression in
the analyze_flakiness.py print statement accordingly so it reads
print('[dry-run]   → insufficient data (need ≥5 runs)', file=sys.stderr) without
changing behavior.
🧹 Nitpick comments (5)
scripts/flakiness/report_flakiness.py (1)

278-279: Verify: Credentials variables may be unbound on certain code paths.

The variables account_id, db_id, token are only assigned when not args.dry_run (Line 278-279), but they're used in the d1_select call on Lines 363-367, which is also guarded by else (not dry_run). The logic appears correct, but the variable scope could be clearer.

Consider initializing to None before the conditional or restructuring to make the scope explicit:

💡 Optional clarity improvement
+    account_id = db_id = token = None
     if not args.dry_run:
         account_id, db_id, token = get_d1_credentials()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/flakiness/report_flakiness.py` around lines 278 - 279, The variables
account_id, db_id, and token are only set inside the if not args.dry_run branch
which can make their scope unclear; to fix, explicitly initialize account_id,
db_id, token = None, None, None before the if or move the d1-related logic
(including the d1_select call) into the same not args.dry_run block so
get_d1_credentials() and d1_select are in the same scope; update references to
args.dry_run, get_d1_credentials, and d1_select accordingly to ensure the
variables are always defined when used.
src/database.py (1)

414-415: Minor: Use f-string conversion flag for cleaner exception formatting.

Per static analysis, prefer {e!s} over {str(e)} for slightly cleaner syntax.

🔧 Proposed fix
     except Exception as e:
-        print(f"Flakiness: Error loading all scores: {str(e)}")
+        print(f"Flakiness: Error loading all scores: {e!s}")
         return {}
     except Exception as e:
         print(
             "Flakiness: Error loading score for "
-            f"{repo}/{workflow_name}/{check_name}/{job_name}: {str(e)}"
+            f"{repo}/{workflow_name}/{check_name}/{job_name}: {e!s}"
         )

Also applies to: 431-434

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database.py` around lines 414 - 415, Replace uses of str(e) in exception
formatting with the f-string conversion flag {e!s} for cleaner syntax;
specifically update the print/error messages that currently use
print(f"Flakiness: Error loading all scores: {str(e)}") and the similar block
around lines 431-434 to use print(f"Flakiness: Error loading all scores: {e!s}")
(or the equivalent logging call) so all exception interpolations use {e!s}
instead of str(e).
migrations/0005_extend_flakiness_scores.sql (1)

41-44: Note: idx_flakiness_scores_lookup may duplicate the PRIMARY KEY index.

SQLite automatically creates an index for composite PRIMARY KEYs. The explicit idx_flakiness_scores_lookup on the same columns is likely redundant but harmless. Consider removing it to reduce index maintenance overhead, or keep it if you need a named index for explicit query hints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@migrations/0005_extend_flakiness_scores.sql` around lines 41 - 44, The
explicit CREATE INDEX for idx_flakiness_scores_lookup duplicates the composite
PRIMARY KEY index on flakiness_scores(repo, workflow_name, check_name,
job_name); remove the CREATE INDEX IF NOT EXISTS idx_flakiness_scores_lookup ...
statement from the migration to avoid redundant index maintenance (or if you
intentionally need a named index for query hints, add a code comment explaining
that decision instead of leaving the redundant index in place).
src/utils.py (1)

479-492: Consider: Aggregate count approach may be overly lenient.

The current logic counts all known-flaky checks for the repo and applies reduced penalty if checks_failed <= known_flaky_count. This could reduce penalties even when the actual failing checks aren't the known-flaky ones.

Since pr_data only carries aggregate counts (not individual check names), this is a pragmatic compromise. The comment at Lines 480-481 acknowledges this as a "conservative upper bound." If more precise matching becomes important later, the readiness endpoint could be enhanced to pass individual check identities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils.py` around lines 479 - 492, The current aggregate matching can
over-credit failures as "known flaky" because known_flaky_count may exceed the
repo's failing checks; to avoid reducing penalties incorrectly, clamp
known_flaky_count to the reported failing checks by using
pr_data.get('checks_failed') when computing the effective flaky count (i.e.,
effective_flaky = min(known_flaky_count, pr_data.get('checks_failed', 0)));
update the computation around known_flaky_count and add a short comment
referencing flakiness_scores, pr_data, and checks_failed that explains this
pragmatic clamp and TODO to switch to per-check identities when the readiness
endpoint can provide them.
src/cache.py (1)

341-342: Minor: Use f-string conversion flag.

🔧 Proposed fix
     except Exception as e:
-        print(f"Flakiness Cache: D1 load failed ({str(e)}), not caching")
+        print(f"Flakiness Cache: D1 load failed ({e!s}), not caching")
         return {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cache.py` around lines 341 - 342, In the except block that prints the D1
load failure message in src/cache.py (the handler that currently does: except
Exception as e: print(f"Flakiness Cache: D1 load failed ({str(e)}), not
caching")), replace the str(e) call with an f-string conversion flag so the
exception is shown with its representation (use {e!r}) to preserve type and
detail; keep the same print call and message context but use {e!r} instead of
str(e).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@migrations/0005_extend_flakiness_scores.sql`:
- Around line 21-36: The INSERT currently INNER JOINs flakiness_scores to
ci_run_history which will drop any flakiness_scores rows with no matching
ci_run_history; change the JOIN to a LEFT JOIN and use a fallback for repo
(e.g., COALESCE(h.repo, s.repo)) so orphaned scores are preserved when inserting
into flakiness_scores_v2, updating the SELECT to pull repo from h when present
otherwise from s; alternatively, if dropping orphans is intentional add a
comment above the query explaining that behavior.

---

Duplicate comments:
In `@scripts/flakiness/analyze_flakiness.py`:
- Line 172: The print call using an f-string at print(f'[dry-run]   →
insufficient data (need ≥5 runs)', file=sys.stderr) has no placeholders—remove
the unnecessary f prefix and make it a normal string literal in the same
print(...) call; update the expression in the analyze_flakiness.py print
statement accordingly so it reads print('[dry-run]   → insufficient data (need
≥5 runs)', file=sys.stderr) without changing behavior.

---

Nitpick comments:
In `@migrations/0005_extend_flakiness_scores.sql`:
- Around line 41-44: The explicit CREATE INDEX for idx_flakiness_scores_lookup
duplicates the composite PRIMARY KEY index on flakiness_scores(repo,
workflow_name, check_name, job_name); remove the CREATE INDEX IF NOT EXISTS
idx_flakiness_scores_lookup ... statement from the migration to avoid redundant
index maintenance (or if you intentionally need a named index for query hints,
add a code comment explaining that decision instead of leaving the redundant
index in place).

In `@scripts/flakiness/report_flakiness.py`:
- Around line 278-279: The variables account_id, db_id, and token are only set
inside the if not args.dry_run branch which can make their scope unclear; to
fix, explicitly initialize account_id, db_id, token = None, None, None before
the if or move the d1-related logic (including the d1_select call) into the same
not args.dry_run block so get_d1_credentials() and d1_select are in the same
scope; update references to args.dry_run, get_d1_credentials, and d1_select
accordingly to ensure the variables are always defined when used.

In `@src/cache.py`:
- Around line 341-342: In the except block that prints the D1 load failure
message in src/cache.py (the handler that currently does: except Exception as e:
print(f"Flakiness Cache: D1 load failed ({str(e)}), not caching")), replace the
str(e) call with an f-string conversion flag so the exception is shown with its
representation (use {e!r}) to preserve type and detail; keep the same print call
and message context but use {e!r} instead of str(e).

In `@src/database.py`:
- Around line 414-415: Replace uses of str(e) in exception formatting with the
f-string conversion flag {e!s} for cleaner syntax; specifically update the
print/error messages that currently use print(f"Flakiness: Error loading all
scores: {str(e)}") and the similar block around lines 431-434 to use
print(f"Flakiness: Error loading all scores: {e!s}") (or the equivalent logging
call) so all exception interpolations use {e!s} instead of str(e).

In `@src/utils.py`:
- Around line 479-492: The current aggregate matching can over-credit failures
as "known flaky" because known_flaky_count may exceed the repo's failing checks;
to avoid reducing penalties incorrectly, clamp known_flaky_count to the reported
failing checks by using pr_data.get('checks_failed') when computing the
effective flaky count (i.e., effective_flaky = min(known_flaky_count,
pr_data.get('checks_failed', 0))); update the computation around
known_flaky_count and add a short comment referencing flakiness_scores, pr_data,
and checks_failed that explains this pragmatic clamp and TODO to switch to
per-check identities when the readiness endpoint can provide them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 034ff890-a354-4901-af94-bbf77a7571b0

📥 Commits

Reviewing files that changed from the base of the PR and between 93a0dd1 and 1c8ca7c.

📒 Files selected for processing (9)
  • migrations/0004_create_flakiness_tables.sql
  • migrations/0005_extend_flakiness_scores.sql
  • scripts/flakiness/analyze_flakiness.py
  • scripts/flakiness/flakiness_config.yml
  • scripts/flakiness/report_flakiness.py
  • scripts/flakiness/retry_failures.py
  • src/cache.py
  • src/database.py
  • src/utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/flakiness/flakiness_config.yml
  • migrations/0004_create_flakiness_tables.sql

@Rudra-rps
Copy link
Copy Markdown
Contributor Author

@Rudra-rps Solid concept and great test coverage. A few things to check:

  • Is flaky_count ever used in classification? Unconfirmed flakes could still reduce CI penalties
  • Is workflow_name intentionally dropped in history lookup?
  • Should issue identity include job_name too?
  • Migration missing dependency and upstream patterns from config - intentional?

The system classifies a check as flaky only when the failure rate crosses the configured threshold and at least one intermittent flaky occurrence is detected.

@Rudra-rps
Copy link
Copy Markdown
Contributor Author

@Rudra-rps Solid concept and great test coverage. A few things to check:

  • Is flaky_count ever used in classification? Unconfirmed flakes could still reduce CI penalties
  • Is workflow_name intentionally dropped in history lookup?
  • Should issue identity include job_name too?
  • Migration missing dependency and upstream patterns from config - intentional?

The system classifies a check as flaky only when the failure rate crosses the configured threshold and at least one intermittent flaky occurrence is detected.

also the system doesn't drop workflow_name, it is included in the distinct job combination includes workflow_name, job_name and check_name

@github-actions github-actions bot added files-changed: 19 PR changes 19 files and removed files-changed: 21 PR changes 21 files labels Mar 15, 2026
@Rudra-rps
Copy link
Copy Markdown
Contributor Author

I made a few optimizations to simplify the flakiness schema and lookup flow.

• Schema cleanup: flakiness_scores now uses (repo, workflow_name, check_name, job_name) as the primary key in 0004_create_flakiness_tables.sql, with repo as the leading column. The indexes previously introduced in 0005 were also moved into this migration.

• Removed redundant migration: Since the schema now includes repo from the start, 0005_extend_flakiness_scores.sql (which previously added the column via a create–copy–rename process) is no longer needed and has been removed.

• Simplified flakiness lookup: The in-memory 60-minute TTL cache layer was removed. handlers.py now directly calls get_all_flakiness_scores() instead of using get_cached_flakiness_scores() from cache.py. The behavior remains the same because get_all_flakiness_scores() already returns {} safely on failure.

Overall this reduces migration complexity and removes an unnecessary caching layer while keeping the runtime behavior unchanged.

@owasp-blt
Copy link
Copy Markdown

owasp-blt bot commented Mar 15, 2026

🎓 A mentor has been assigned to this issue!

Mentor: @dev-sanidhya
Contributor: @ojaswa072

@dev-sanidhya — please provide guidance and support. Use /handoff if you need to transfer mentorship.

@ojaswa072@dev-sanidhya will help you through this. Feel free to ask questions here. Use /rematch if you need a different mentor.

Happy coding! 🚀 — OWASP BLT-Pool

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/handlers.py (1)

1587-1588: Reuse one DB handle inside _run_readiness_analysis.

This works, but opening another DB handle at Line 1588 is avoidable. Reusing one local db instance keeps the function tighter and reduces per-call overhead.

♻️ Proposed refactor
 async def _run_readiness_analysis(env, pr, pr_id, github_token):
@@
-        # Calculate and update review_status from timeline data
+        db = get_db(env)
+
+        # Calculate and update review_status from timeline data
         # This ensures the database has the latest review status without making duplicate API calls
         original_review_status = pr.get('review_status', 'pending')
         review_status = calculate_review_status(timeline_data.get('reviews', []))
         if review_status != original_review_status:
             # Update review_status in database only if it actually changed
-            db = get_db(env)
             await db.prepare(
                 'UPDATE prs SET review_status = ?, updated_at = CURRENT_TIMESTAMP WHERE id = ?'
             ).bind(review_status, pr_id).run()
             pr['review_status'] = review_status
@@
-        flakiness_scores = await get_all_flakiness_scores(get_db(env))
+        flakiness_scores = await get_all_flakiness_scores(db)

Also applies to: 1591-1594

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers.py` around lines 1587 - 1588, In _run_readiness_analysis, avoid
opening multiple DB handles by calling get_db(env) once into a local variable
(e.g., db) and reuse that handle for subsequent calls such as
get_all_flakiness_scores(db), get_latest_member_metrics(db), and
get_member_merged_prs_by_repo(db) (and any other DB-using helpers in the same
function), replacing the extra get_db(env) invocations so the function reuses a
single DB instance for all operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/handlers.py`:
- Around line 1587-1588: In _run_readiness_analysis, avoid opening multiple DB
handles by calling get_db(env) once into a local variable (e.g., db) and reuse
that handle for subsequent calls such as get_all_flakiness_scores(db),
get_latest_member_metrics(db), and get_member_merged_prs_by_repo(db) (and any
other DB-using helpers in the same function), replacing the extra get_db(env)
invocations so the function reuses a single DB instance for all operations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: OWASP-BLT/coderabbit/.coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 90ce6247-fe73-4264-94cb-7b5c3ac17ef8

📥 Commits

Reviewing files that changed from the base of the PR and between 1c8ca7c and 99df40a.

📒 Files selected for processing (2)
  • migrations/0004_create_flakiness_tables.sql
  • src/handlers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • migrations/0004_create_flakiness_tables.sql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants