diff --git a/README.md b/README.md index d3b36c26..9ec65b8b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ -# k8s-auto-fix +# Closed-Loop Threat-Guided Auto-Fixing of Kubernetes YAML Security Misconfigurations -`k8s-auto-fix` is a closed-loop pipeline that detects Kubernetes misconfigurations, proposes JSON patches, verifies them against guardrails, and schedules accepted fixes. It supports deterministic rules as well as Grok and OpenAI-compatible LLM modes, and underpins the accompanying research paper. +## Abstract +Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built `k8s-auto-fix` to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9×. We release all data and scripts so others can reproduce these results. ## Key features - End-to-end detector -> proposer -> verifier -> risk -> scheduler -> queue workflow with reproducible CLI entry points. @@ -40,15 +41,18 @@ Benchmark helpers (`make benchmark-grok200`, `make benchmark-full`, `make benchm - `archives/` – historical exports and large bundles kept out of the active workspace. - `configs/` – pipeline presets (`run.yaml`, `run_grok.yaml`, `run_rules.yaml`). - `data/` – retains the canonical folders (`data/manifests`, `data/batch_runs`, etc.) and now exposes curated views via `data/corpora/` (inputs) and `data/outputs/` (generated artefacts). See `data/README.md` for details. +- `docker/` – container definition for reproducible environment. - `docs/` – research notes, policy guidance, reproducibility appendices, future work plans. -- `infra/fixtures/` – RBAC, NetworkPolicies, and manifest samples (CronJob scanner, Bitnami PostgreSQL) for reproducing edge cases. +- `figures/` – plots and diagrams used in the research paper and README. +- `infra/` – infrastructure definitions (`fixtures/`, `crds/`). - `logs/` – proposer/verifier transcripts, Grok sweep summaries, and root-level logs (e.g. `logs/access.log`). -- `notes/` – working notes and backlog items formerly at the repository root. +- `notes/` – working notes and backlog items. - `paper/` – IEEE Access manuscript sources; archives in `paper/archives/` and the Overleaf export tracked under `paper/overleaf/`. +- `policies/` – baseline policy definitions (e.g. Kyverno mutating rules). - `scripts/` – maintenance and evaluation helpers; see `scripts/README.md` for an index by pipeline stage. - `src/` – core packages (`common`, `detector`, `proposer`, `risk`, `scheduler`, `verifier`). - `tests/` – pytest suite validating detectors, proposer guardrails, verifier gates, scheduler scoring, CLI tooling. -- `tmp/` – scratch workspace (ignored by git). Historic large exports remain under `archives/` if needed. +- `verification/` – literature review materials and OCR references. ## Configuration `configs/run.yaml` centralises proposer configuration: @@ -84,28 +88,26 @@ Export the appropriate API key (`XAI_API_KEY`, `OPENAI_API_KEY`, `RUNPOD_API_KEY - `scripts/parallel_runner.py` - parallelise proposer/verifier workloads; `scripts/probe_grok_rate.py` sizes safe LLM concurrency. ## Datasets and metrics (Oct 2025 snapshot) -- **Rules baseline (full corpus)** – 13,589 / 13,656 fixes (99.5 percent) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them). -- **Grok full corpus** – 1,313 / 1,313 accepted (100 percent) with median JSON Patch length 6 (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`). -- **Secondary supported corpus** – 1,264 / 1,264 accepted in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`. +- **Rules baseline (full corpus)** – 13,589 / 13,656 patched items (99.51%) (auto-fix rate 0.8646 over 15,718 detections) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them). +- **Grok manifest slice** – 1,313 / 1,313 accepted (100.00%) (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`). +- **Grok 5k corpus** – 4,426 / 5,000 accepted (88.52%) (see `data/batch_runs/grok_5k/metrics_grok5k.json`). +- **Secondary supported corpus** – 1,264 / 1,264 accepted (100.00%) in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`. - Policy-level success probabilities and runtimes are regenerated via `scripts/compute_policy_metrics.py` into `data/policy_metrics.json`. - Scheduler evaluation (`docs/scheduler_visualisation.md`, viewable at `data/outputs/scheduler/metrics_schedule_sweep.json`) compares bandit, risk-only, and FIFO strategies. Large corpus artefacts now live under `data/outputs/` and are stored as compressed `.json.gz` files to keep the repository lean. Run `gunzip data/patches_rules_full.json.gz` (and the verified counterpart) before tooling that expects the plain `.json` filenames. -## Roadmap -- Q4 2025 - publish a containerised reproducibility bundle for one-command replays. -- Q1 2026 - rerun Grok corpora with live latency/token telemetry. -- Q1 2026 - validate against an external CNCF corpus. -- Q2 2026 - expand operator studies and incorporate threat-mitigation guard metadata into CI. +## Comparison of automated Kubernetes remediation systems -## Related work -| System | Acceptance / fix rate | Corpus | Guardrail highlights | Scheduling | -| ------ | -------------------- | ------ | ------------------- | ---------- | -| **k8s-auto-fix** | 88.78% (Grok-5k), 93.54% / 100% (supported rules), 100% (Grok 1.313k) | 5k + 1.3k manifests | Secret sanitisation, privileged DaemonSet hardening, CRD seeding, triad verification | Bandit scheduler with policy metrics | -| GenKubeSec (2024) | ~85-92% (curated 200) | 200 manifests | LLM reasoning with human review | None | -| Kyverno (2023+) | 80-95% (policy mutation) | Thousands | Policy-driven mutation/generation | Admission queue | - -Note: Production SRE automation systems (e.g., Google Borg) discuss automation principles publicly, but we do not cite a public acceptance percentage and therefore avoid drawing numeric comparisons. +| Capability | k8s-auto-fix (this work) | GenKubeSec (2024) | Kyverno (2023+) | Borg/SRE (2015+) | +| :--- | :--- | :--- | :--- | :--- | +| **Primary Goal** | Closed-loop hardening (detect→patch→verify→prioritize) | LLM-based detection/remediation suggestions | Admission-time policy enforcement | Large-scale auto-remediation in production clusters | +| **Fix Mode** | JSON Patch (rules + optional LLM) | LLM-generated YAML edits | Policy mutation/generation | Custom controllers and playbooks | +| **Guardrails** | Policy re-check + schema + `kubectl apply --dry-run=server` + privileged/secret sanitization + CRD seeding | Manual review; no automated gates | Validation/mutation webhooks; assumes controllers | Health checks, automated rollback, throttling | +| **Risk Prioritization** | Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) | Not implemented | FIFO admission queue | Priority queues / toil budgets | +| **Evaluation Corpus** | 15,718 detections (rules+guardrails: 13,589/13,656 patched = 99.51%; auto-fix 0.8646); 1,000 live-cluster manifests (100.0% success); 5,000 Grok manifests (88.52%) | 200 curated manifests (85–92% accuracy) | Thousands of user manifests (80–95% mutation acceptance) | Millions of production workloads (no public acceptance %) | +| **Telemetry** | Policy-level success probabilities, latency histograms, failure taxonomy | Token/cost estimates; no pipeline telemetry | Admission latency <45 ms, violation counts | MTTR, incident counts, operator feedback | +| **Outstanding Gaps** | Infrastructure-dependent rejects, operator study, scheduled guidance refresh in CI | Automated guardrails, risk-aware ordering | LLM-aware patching, risk-aware scheduling | Declarative manifest fixes, static analysis integration | ## Baselines and Reproducibility @@ -122,4 +124,3 @@ scripts/reproduce_all.sh ``` See `ARTIFACTS.md` for artifact map, `docs/VERIFIER.md` for guardrails, `docs/BASELINES.md` to run baselines, `docs/RISK_EVAL.md` for prioritization metrics, and `docs/LIVE_EVAL.md` for live-cluster methodology. -| Magpie (2024) | ~84% dry-run acceptance | 9.5k manifests | RBAC and PSP static analysis | None | diff --git a/data/batch_runs/grok_5k/metrics_grok5k.json b/data/batch_runs/grok_5k/metrics_grok5k.json index 9bc2e3a9..5c24aa62 100644 --- a/data/batch_runs/grok_5k/metrics_grok5k.json +++ b/data/batch_runs/grok_5k/metrics_grok5k.json @@ -14,4 +14,4 @@ "completion_tokens": 689779.0, "total_tokens": 11399926.0 } -} \ No newline at end of file +} diff --git a/data/eval/unified_eval_summary.json b/data/eval/unified_eval_summary.json index d959cb95..378b36f0 100644 --- a/data/eval/unified_eval_summary.json +++ b/data/eval/unified_eval_summary.json @@ -52,13 +52,13 @@ } }, { - "dataset": "Manifest 1.313k", + "dataset": "Full Corpus (Rules)", "mode": "rules", "seed": 1337, "note": "Full manifest slice in deterministic rules mode.", - "total": 13656, + "total": 15718, "accepted": 13589, - "acceptance_rate": 0.9951, + "acceptance_rate": 0.8646, "median_patch_ops": 8, "proposer_latency_ms": { "count": 0, diff --git a/data/metrics_rules_full.json b/data/metrics_rules_full.json index bc5261f0..f5e01563 100644 --- a/data/metrics_rules_full.json +++ b/data/metrics_rules_full.json @@ -1,9 +1,9 @@ { - "detections": 13656, + "detections": 15718, "patches": 13656, "verified": 13656, "accepted": 13589, - "auto_fix_rate": 0.9951, + "auto_fix_rate": 0.8646, "median_patch_ops": 8, "failed_policy": 63, "failed_schema": 63, @@ -14,4 +14,4 @@ "completion_tokens": 0.0, "total_tokens": 0.0 } -} \ No newline at end of file +} diff --git a/figures/admission_vs_posthoc.png b/figures/admission_vs_posthoc.png index ea92ac66..30bcc07c 100644 Binary files a/figures/admission_vs_posthoc.png and b/figures/admission_vs_posthoc.png differ diff --git a/figures/fairness_waits.png b/figures/fairness_waits.png index 195d3e4b..b473d245 100644 Binary files a/figures/fairness_waits.png and b/figures/fairness_waits.png differ diff --git a/figures/mode_comparison.png b/figures/mode_comparison.png index 0b1542b4..86188a7c 100644 Binary files a/figures/mode_comparison.png and b/figures/mode_comparison.png differ diff --git a/figures/operator_ab.png b/figures/operator_ab.png index d0fb12b1..9dccc1b5 100644 Binary files a/figures/operator_ab.png and b/figures/operator_ab.png differ diff --git a/final-to-do.md b/final-to-do.md deleted file mode 100644 index 962f1917..00000000 --- a/final-to-do.md +++ /dev/null @@ -1,273 +0,0 @@ -# Final To-Do - -## Paper Quality Evaluation (Nov 11, 2025) -**Overall Score: 8.1/10** - Strong Accept with Minor Revisions - -### HIGH PRIORITY (Critical for publication rigor) - -1. **[SUBSTANTIVE] Add statistical significance tests** ⚗️ - **Effort**: 2-3 hours | **Impact**: Critical for top-tier venues - - Add p-values to Table 4 (eval_summary) for acceptance rate comparisons - - Use proportion z-tests for acceptance rates (88.78% vs 99.51%) - - Use Mann-Whitney U tests for latency comparisons (non-parametric) - - Add note under tables: "p < 0.001 for all pairwise comparisons" - - **Tool**: Python `scipy.stats.proportions_ztest()` and `mannwhitneyu()` - - **Gap**: This is the only substantive scientific gap - everything else is presentation - _Status: Completed via `scripts/eval_significance.py`, `data/eval/significance_tests.json`, and the new Table~\ref{tab:eval_summary} note._ - -2. **[POLISH] Break up mega-sentences in evaluation section** ✂️ - **Effort**: 30 minutes | **Impact**: High readability improvement - - Lines 516-520: One sentence spanning 6+ lines - split into 4 sentences - - Lines 833-834: Discussion mega-paragraph - split into 6 sentences - - Target: No sentence longer than 3 lines in compiled PDF - _Status: Evaluation section rewritten at `paper/access.tex:521-542`. Discussion section broken into 4 logical paragraphs at `paper/access.tex:853-859`. **✅ COMPLETE (Nov 11, 2025)**_ - -3. **[POLISH] Remove file paths from abstract** 🎨 - **Effort**: 5 minutes | **Impact**: Convention compliance - - Line 92: Remove `\texttt{data/live\_cluster/results\_1k.json}` from abstract - - Keep file paths in body text where they provide reproducibility value - - Abstract should be self-contained without implementation details - _Status: Abstract now cites only the success rate (see `paper/access.tex:88-90`)._ - -4. **[SUBSTANTIVE] Add explicit threat model subsection** 🛡️ - **Effort**: 1 hour | **Impact**: Medium (critical for security venues) - - Lines 633-635 mention "malicious manifests" but don't define adversary - - Add "Threat Model" subsection in Section 4 (before or after "Threats and Mitigations") - - Define: trusted components (detector, verifier), untrusted inputs (manifests, LLM outputs) - - State which attacks are in/out of scope (supply chain, prompt injection, fixture poisoning) - _Status: Section~4.1 now documents the threat model (`paper/access.tex:608-620`)._ - -### MEDIUM PRIORITY (Enhances rigor and clarity) - -5. **[SUBSTANTIVE] Define fairness metrics explicitly** 📐 - **Effort**: 1 hour | **Impact**: Medium - - Line 526: "Gini 0.351, starvation rate 0" appears without definition - - Define starvation threshold (e.g., "items waiting >24 hours") - - Compare Gini to FIFO baseline Gini (is 0.351 good or bad?) - - Add fairness plot showing wait time distribution by risk quartile - _Status: Definitions reference \url{data/scheduler/fairness_metrics.json} and Figure~\ref{fig:fairness} in `paper/access.tex:812-826`._ - -6. **[POLISH] Add figure interpretations inline** 📊 - **Effort**: 30 minutes | **Impact**: Medium - - After Fig 2 reference: Add 1-2 sentences interpreting trends - - Example: "Figure 2 shows rules-only maintains 99.5% acceptance while LLM drops to 88.8%" - - Apply to all figures (admission_vs_posthoc, mode_comparison, operator_ab) - _Status: Interpretations accompany Figures~\ref{fig:mode_comparison}, \ref{fig:admission_vs_posthoc}, \ref{fig:fairness}, and \ref{fig:operator_ab}. _ - -7. **[POLISH] Consolidate notation** 📝 - **Effort**: 1 hour | **Impact**: Medium - - Risk score formula appears at lines 698, 702, and Appendix 843+ - - Create single "Notation" box in Section 3 - - Reference back consistently: "as defined in Eq. (1)" - - Consider notation table if symbols exceed 10 - _Status: Shared notation now lives in `paper/access.tex:201-207` and is cited by the scheduler equation._ - -8. **[REFACTOR] Restructure approach vs implementation sections** 🏗️ - **Effort**: 2 hours | **Impact**: Medium (clarity) - - Current: "Approach Summary" (line 198) comes before "Implementation" (312) but leaks details - - Proposed structure: - - Section 2: System Design (architecture, guardrails, conceptual flow) - - Section 3: Implementation (code, artifacts, metrics definitions) - - Section 4: Evaluation (results, baselines, ablations) - - Section 5: Discussion - _Status: Sections now read "System Design," "Implementation and Metrics," and "Evaluation" (`paper/access.tex:198-324`)._ - -### LOW PRIORITY (Polish only - nice to have) - -9. **[POLISH] Add acronym table or expand more frequently** 🔤 - **Effort**: 30 minutes | **Impact**: Low - - Heavy acronym use: PSS, CIS, KEV, EPSS, RAG, CVE, CVSS, MAP, CEL, MTTR, RBAC, CRD, CTI - - Option A: Add acronym table in front matter - - Option B: Re-expand acronyms if first use was >5 pages ago - _Status: Appendix~\ref{app:acronyms} catalogs the acronyms._ - -10. **[POLISH] Consistency pass on code font** 💻 - **Effort**: 15 minutes | **Impact**: Low - - Sometimes `\texttt{kubectl}` (correct), sometimes plain "kubectl" (line 344) - - Run grep for tool names and wrap consistently in `\texttt{}` - - Apply to: kubectl, kube-linter, helm, docker, python - _Status: Tool names in the environment table and ArtifactHub section now use `\texttt{}`._ - -11. **[POLISH] Consider footnotes for long commands** 📝 - **Effort**: 30 minutes | **Impact**: Low - - Line 874: `python scripts/\allowbreak collect_artifacthub.py\ --limit\ 5000` still overflows - - Use footnotes for commands longer than ~60 characters - - Keeps main text cleaner - _Status: ArtifactHub instructions now cite the command in a footnote (`paper/access.tex:874-875`)._ - ---- - -## Outstanding Repository Tasks - -- _None._ Item 26 in `notes/to-do list` is now closed (Nov 14) with a documented rationale for keeping the 1k AKS replay as the terminal live-cluster sweep (policy/resource coverage + \$4–5k cost avoidance); see `notes/to-do list:161-164` for details. - -## Submission-Readiness Follow-ups - -- **Confirm future-work placeholders** - Validate that references to future experiments (expanded detector validation, new scheduler ablations, large-corpus latency telemetry) are either executed or clearly labeled as future work before submission. (Source: user "Progress & Decisions" summary.) - -- **Recompile after final edits** - Continue running `pdflatex -interaction=nonstopmode -halt-on-error paper/access.tex` whenever new changes land so `paper/access.pdf` remains in sync with `paper/access.tex`. (Source: user "Next Steps".) - -- **Refresh Grok/LLM latency metrics on request** *(OVERLAPS with High Priority Item 1)* - Table \ref{tab:eval_summary} still shows "—" for Grok timing; be prepared to regenerate latency telemetry (`data/batch_runs/grok200_latency_summary.csv`, `data/batch_runs/grok_5k/metrics_grok5k.json`) if updated numbers are required. (Source: user "Next Steps".) - ---- - -## OVERLAP ANALYSIS - -### Direct Overlaps: -- ✅ **"Refresh Grok/LLM latency metrics"** overlaps with **High Priority Item 1** (statistical tests need this data) -- ✅ **"Recompile after final edits"** already covered - applies to all paper changes - -### Complementary Items: -- **Repository task "Live-cluster sweep expansion"** + **Paper item "Statistical tests"** = Both strengthen evaluation rigor -- **"Confirm future-work placeholders"** + **Paper item "Threat model"** = Both clarify scope/assumptions - ---- - -## WHAT ELSE NEEDS TO BE DONE (Gap Analysis) - -### Missing from Original To-Do: -1. ❌ **No mention of statistical rigor** - This is the BIGGEST gap - **Action**: High Priority Item 1 (stat tests) addresses this - -2. ❌ **No readability/writing quality items** - Paper is very dense - **Action**: High Priority Items 2-3 (break sentences, clean abstract) address this - -3. ❌ **No security model clarity** - Paper mentions threats but doesn't formalize - **Action**: High Priority Item 4 (threat model) addresses this - -### Still Missing (New Items to Consider): - -1. **Experimental reproducibility verification** - - Has anyone external run `make detect && make propose && make verify`? - - Consider VM/container test before submission - - Add to "Submission-Readiness": "External reproducibility smoke test" - -2. **Acknowledgments section** - - Paper currently has placeholder funding note (line 86) - - Need to acknowledge: dataset sources, infrastructure providers, reviewers - - Add to "Submission-Readiness": "Finalize acknowledgments" - -3. **Author biographies completeness** - - Lines 952-964: Biographies present but photos may need verification - - Ensure `brian_mendonca_photo.png` and `vijay_madisetti_photo.png` exist and are high-res - - Add to "Submission-Readiness": "Verify author photos" - -4. **Bibliography completeness** - - Lines 886-947: 17 references (seems low for systems paper) - - Consider adding: Kubernetes security surveys, policy enforcement papers, bandit algorithm foundations - - Add to "Medium Priority": "Expand related work citations (target 25-30 refs)" - -5. **LaTeX compilation warnings check** - - Run with `-file-line-error` flag to catch overfull hboxes, undefined refs - - Add to "Submission-Readiness": "Fix all LaTeX warnings" - ---- - -## RECOMMENDED EXECUTION ORDER - -**Week 1** (Critical path): -1. High Priority Item 1 (stat tests) - 3 hours -2. High Priority Item 2 (break sentences) - 30 min -3. High Priority Item 3 (clean abstract) - 5 min -4. Recompile and verify PDF - -**Week 2** (If time permits): -5. High Priority Item 4 (threat model) - 1 hour -6. Medium Priority Items 5-6 (fairness metrics, figure interp) - 1.5 hours -7. Address new item: Bibliography expansion - 2 hours -8. Final recompile - -**Before Submission**: -9. External reproducibility test (new item) -10. Finalize acknowledgments (new item) -11. Verify author photos (new item) -12. Fix all LaTeX warnings (new item) - ---- - -## PUBLICATION READINESS ESTIMATE - -**Current State**: 75th percentile (solid work) -**After High Priority items**: 85th percentile (strong accept) -**After High + Medium items**: 90th percentile (reference paper) -**After all items**: 95th percentile (exemplary) - -**Estimated total effort**: 8-12 hours for High Priority, 16-20 hours for complete - ---- - -## SUMMARY FOR MR. BRIAN - -### The Brutal Truth: -- **60%** of revisions are pure polish (readability, formatting) -- **30%** are methodological substance (stat tests, fairness definitions) -- **10%** are architectural clarity (threat model) - -### The 80/20 Fix: -Do these 2 items for 80% of the benefit: -1. **Add statistical significance tests** (2-3 hours) - Only substantive gap -2. **Break up mega-sentences** (30 min) - Biggest readability win - -Everything else is "nice to have" polish. - -### Original Status (Before Revisions): -Your paper was **already at 75th percentile** for technical quality. These revisions pushed it to 90-95th percentile. The core science was solid - we just made reviewers' lives easier. - ---- - -## ✅ COMPLETION STATUS (Nov 11, 2025) - -### **ALL 11 PRIORITY ITEMS: 100% COMPLETE** - -**High Priority (4/4):** ✅ COMPLETE -- Item 1: Statistical significance tests ✅ -- Item 2: Break up mega-sentences ✅ (Final fix applied) -- Item 3: Remove file paths from abstract ✅ -- Item 4: Explicit threat model ✅ - -**Medium Priority (4/4):** ✅ COMPLETE -- Item 5: Define fairness metrics ✅ -- Item 6: Add figure interpretations ✅ -- Item 7: Consolidate notation ✅ -- Item 8: Restructure sections ✅ - -**Low Priority (3/3):** ✅ COMPLETE -- Item 9: Acronym table ✅ -- Item 10: Code font consistency ✅ -- Item 11: Footnotes for long commands ✅ - ---- - -## 📊 FINAL QUALITY ASSESSMENT - -**Publication Readiness: 95th Percentile** ⭐⭐⭐⭐⭐ - -| Dimension | Score | Notes | -|-----------|-------|-------| -| Technical Contribution | 8.5/10 | Novel verifier triad + risk-aware scheduler | -| Reproducibility | 9.5/10 | Exemplary (best-in-class artifact management) | -| Writing Quality | 8.5/10 | ✅ **IMPROVED** from 7.0 - All mega-sentences fixed | -| Evaluation Rigor | 9.0/10 | ✅ **IMPROVED** from 8.0 - Statistical tests added | -| Comparison Fairness | 8.5/10 | Honest about limitations | -| Practical Impact | 8.0/10 | Real problem, deployable solution | -| Figures/Tables | 8.5/10 | ✅ **IMPROVED** from 7.5 - All interpretations added | - -**COMPOSITE SCORE: 8.6/10** → **Strong Accept** (improved from 8.1/10) - ---- - -## 🎯 REMAINING PRE-SUBMISSION TASKS - -All 11 priority items complete. Only standard submission checklist remains: - -1. **Final recompile** - Run `pdflatex` to generate clean PDF -2. **LaTeX warnings check** - Fix any overfull hboxes -3. **Bibliography review** - Consider expanding from 17 to 20-25 refs (optional) -4. **Acknowledgments** - Finalize funding/contributor acknowledgments -5. **Author photos** - Verify high-res photos exist (✅ already confirmed present) -6. **External reproducibility test** - Have someone run `make detect && make propose && make verify` - -**Estimated time to submission-ready: 2-3 hours** (mostly admin tasks) diff --git a/paper/access.tex b/paper/access.tex index 2eecb675..6994816f 100644 --- a/paper/access.tex +++ b/paper/access.tex @@ -91,7 +91,7 @@ \IEEEtitleabstractindextext{% \begin{abstract} -Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built \texttt{k8s-auto-fix} to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,338 of 13,373 patched items (99.74\%; auto-fix rate 0.8486; median patch length 9). An optional LLM mode reaches 88.78\% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9$\times$. We release all data and scripts so others can reproduce these results. +Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built \texttt{k8s-auto-fix} to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51\%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52\% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9$\times$. We release all data and scripts so others can reproduce these results. \end{abstract} \begin{IEEEkeywords} @@ -138,7 +138,7 @@ \section{Importance of the Problem} \midrule \textbf{Risk Prioritization} & Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) & Not implemented & FIFO admission queue & Priority queues / toil budgets \\ \midrule -\textbf{Evaluation Corpus} & 15{,}718 detections (rules+guardrails: 13{,}338/13{,}373 patched = 99.74\%; auto-fix 0.8486; median ops 9); 1{,}000 live-cluster manifests (100.0\% success); 5{,}000 Grok manifests (88.78\%); 1{,}264 supported manifests (100.00\% rules) & 200 curated manifests (85--92\% accuracy) & Thousands of user manifests (80--95\% mutation acceptance) & Millions of production workloads (no public acceptance \%) \\ +\textbf{Evaluation Corpus} & 15{,}718 detections (rules+guardrails: 13{,}589/13{,}656 patched = 99.51\%; auto-fix 0.8646; median ops 8); 1{,}000 live-cluster manifests (100.0\% success); 5{,}000 Grok manifests (88.52\%); 1{,}264 supported manifests (100.00\% rules) & 200 curated manifests (85--92\% accuracy) & Thousands of user manifests (80--95\% mutation acceptance) & Millions of production workloads (no public acceptance \%) \\ \midrule \textbf{Telemetry} & Policy-level success probabilities, latency histograms, failure taxonomy & Token/cost estimates; no pipeline telemetry & Admission latency $<45$~ms, violation counts & MTTR, incident counts, operator feedback \\ \midrule @@ -325,7 +325,7 @@ \subsection{End-to-End Walkthrough on Real Manifests} \subsection{Research Questions and Findings} \begin{enumerate} - \item[\textbf{RQ1}] \textbf{Robustness:} The closed loop delivers 88.78\% acceptance on the Grok-5k sweep, 100.00\% on the supported 1,264-manifest corpus in rules mode, and 13{,}338/13{,}373 (99.74\%) accepted on the full 15,718-detection run under deterministic rules + guardrails (auto-fix rate 0.8486 over detections; median ops 9), with no hostPath-related safety failures remaining. + \item[\textbf{RQ1}] \textbf{Robustness:} The closed loop delivers 88.78\% acceptance on the Grok-5k sweep, 100.00\% on the supported 1,264-manifest corpus in rules mode, and 13{,}589/13{,}656 (99.51\%) accepted on the full 15,718-detection run under deterministic rules + guardrails (auto-fix rate 0.8646 over detections; median ops 8), with no hostPath-related safety failures remaining. \item[\textbf{RQ2}] \textbf{Scheduling Effectiveness:} The bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) improves risk reduction per hour and reduces top-risk P95 wait from 102.3~hours (FIFO) to 13.0~hours ($7.9\times$). \item[\textbf{RQ3}] \textbf{Fairness:} Aging prevents starvation, keeping mean rank for the top-50 high-risk items at 25.5 while still progressing lower-risk items. \item[\textbf{RQ4}] \textbf{Patch Quality:} Generated JSON Patches remain minimal (median 5 ops; P95 6) and idempotent (checked by \texttt{tests/test\_patch\_minimality.py}). @@ -669,7 +669,7 @@ \subsection{Evaluation Results} Supported (rules, 1{,}264) & 1337 & 1264/1264 (100.00\%) & 29.0 & 242.0 & 517.8 \\ Full corpus (rules+guardrails, 15{,}718 detections) & 1337 & 13338/13373 (99.74\%; auto-fix 0.8486 over detections) & -- & -- & -- \\ Manifest slice (Grok/xAI, 1{,}313) & 1337 & 1313/1313 (100.00\%) & 5095.5 & 138.4 & 904.6 \\ -Grok-5k (Grok/xAI) & 1337 & 4439/5000 (88.78\%) & 5095.5 & 138.4 & 904.6 \\ +Grok-5k (Grok/xAI) & 1337 & 4426/5000 (88.52\%) & 5095.5 & 138.4 & 904.6 \\ \bottomrule \end{tabularx}} \endgroup @@ -898,7 +898,7 @@ \section{Limitations and Mitigations} \end{itemize} \section{Discussion and Future Work} -The current pipeline achieves 100.0\% live-cluster success (1,000/1,000 stratified manifests) with perfect dry-run/live-apply alignment and surpasses academic baselines (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/live_cluster/results_1k.json}{data/live\_cluster/results\_1k.json}). Across offline corpora, the system delivers 93.54\% acceptance on the 5k supported corpus, 100.00\% on the 1,264-manifest supported slice, 100.00\% on the 1,313-manifest Grok/xAI run, and 88.78\% on Grok-5k overall, while deterministic rules + guardrails now accept 13{,}338 / 13{,}373 patched items (99.74\%; auto-fix rate 0.8486 over 15,718 detections) with median patch ops 9 (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/metrics_latest.json}{data/metrics\_latest.json}). The risk-aware scheduler trims top-risk P95 wait times from 102.3\,h (FIFO) to 13.0\,h (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/scheduler/metrics_sweep_live.json}{data/scheduler/metrics\_sweep\_live.json}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/outputs/scheduler/metrics_schedule_sweep.json}{data/outputs/scheduler/metrics\_schedule\_sweep.json}). +The current pipeline achieves 100.0\% live-cluster success (1,000/1,000 stratified manifests) with perfect dry-run/live-apply alignment and surpasses academic baselines (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/live_cluster/results_1k.json}{data/live\_cluster/results\_1k.json}). Across offline corpora, the system delivers 93.54\% acceptance on the 5k supported corpus, 100.00\% on the 1,264-manifest supported slice, 100.00\% on the 1,313-manifest Grok/xAI run, and 88.52\% on Grok-5k overall, while deterministic rules + guardrails now accept 13{,}589 / 13{,}656 patched items (99.51\%; auto-fix rate 0.8646 over 15,718 detections) with median patch ops 8 (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/metrics_latest.json}{data/metrics\_latest.json}). The risk-aware scheduler trims top-risk P95 wait times from 102.3\,h (FIFO) to 13.0\,h (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/scheduler/metrics_sweep_live.json}{data/scheduler/metrics\_sweep\_live.json}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/outputs/scheduler/metrics_schedule_sweep.json}{data/outputs/scheduler/metrics\_schedule\_sweep.json}). All metrics in this paper are regenerated from the public artifact bundle (\texttt{make reproducible-report}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/ARTIFACTS.md}{ARTIFACTS.md}), and the scheduler comparisons we report stem from deterministic queue replays rather than live analyst rotations. These gains are anchored in deterministic guardrails, schema validation, and server-side dry-run enforcement, with matching Reasoning API runs available to practitioners who can supply xAI credentials and budget roughly \$1.22 per 5k sweep under the published pricing (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/grok5k_telemetry.json}{data/grok5k\_telemetry.json}, \cite{xai_pricing}). To prevent configuration drift, every accepted patch is surfaced as a pull request through our GitOps helper (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/scripts/gitops_writeback.py}{scripts/gitops\_writeback.py}), which records verifier evidence, captures the JSON Patch diff, and requires human approval before merge, mirroring the workflow detailed in \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/docs/GITOPS.md}{docs/GITOPS.md}. diff --git a/src/eval/metrics.py b/src/eval/metrics.py index 0ef5628a..060ee93c 100644 --- a/src/eval/metrics.py +++ b/src/eval/metrics.py @@ -1,5 +1,6 @@ from __future__ import annotations +import gzip import json import statistics from pathlib import Path @@ -93,10 +94,17 @@ def run( def _load_array(path: Path) -> List[Any]: try: - with path.open("r", encoding="utf-8") as f: - data = json.load(f) + if path.suffix == ".gz": + with gzip.open(path, "rt", encoding="utf-8") as f: + data = json.load(f) + else: + with path.open("r", encoding="utf-8") as f: + data = json.load(f) except FileNotFoundError: return [] + except Exception as e: + typer.echo(f"Error loading {path}: {e}", err=True) + return [] if not isinstance(data, list): return [] return data @@ -104,4 +112,3 @@ def _load_array(path: Path) -> List[Any]: if __name__ == "__main__": # pragma: no cover typer.run(run) -