bmendonca3 · google-labs-jules · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
-# k8s-auto-fix
+# Closed-Loop Threat-Guided Auto-Fixing of Kubernetes YAML Security Misconfigurations
 
-`k8s-auto-fix` is a closed-loop pipeline that detects Kubernetes misconfigurations, proposes JSON patches, verifies them against guardrails, and schedules accepted fixes. It supports deterministic rules as well as Grok and OpenAI-compatible LLM modes, and underpins the accompanying research paper.
+## Abstract
+Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built `k8s-auto-fix` to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9×. We release all data and scripts so others can reproduce these results.
 
 ## Key features
 - End-to-end detector -> proposer -> verifier -> risk -> scheduler -> queue workflow with reproducible CLI entry points.
@@ -40,15 +41,18 @@ Benchmark helpers (`make benchmark-grok200`, `make benchmark-full`, `make benchm
 - `archives/` – historical exports and large bundles kept out of the active workspace.
 - `configs/` – pipeline presets (`run.yaml`, `run_grok.yaml`, `run_rules.yaml`).
 - `data/` – retains the canonical folders (`data/manifests`, `data/batch_runs`, etc.) and now exposes curated views via `data/corpora/` (inputs) and `data/outputs/` (generated artefacts). See `data/README.md` for details.
+- `docker/` – container definition for reproducible environment.
 - `docs/` – research notes, policy guidance, reproducibility appendices, future work plans.
-- `infra/fixtures/` – RBAC, NetworkPolicies, and manifest samples (CronJob scanner, Bitnami PostgreSQL) for reproducing edge cases.
+- `figures/` – plots and diagrams used in the research paper and README.
+- `infra/` – infrastructure definitions (`fixtures/`, `crds/`).
 - `logs/` – proposer/verifier transcripts, Grok sweep summaries, and root-level logs (e.g. `logs/access.log`).
-- `notes/` – working notes and backlog items formerly at the repository root.
+- `notes/` – working notes and backlog items.
 - `paper/` – IEEE Access manuscript sources; archives in `paper/archives/` and the Overleaf export tracked under `paper/overleaf/`.
+- `policies/` – baseline policy definitions (e.g. Kyverno mutating rules).
 - `scripts/` – maintenance and evaluation helpers; see `scripts/README.md` for an index by pipeline stage.
 - `src/` – core packages (`common`, `detector`, `proposer`, `risk`, `scheduler`, `verifier`).
 - `tests/` – pytest suite validating detectors, proposer guardrails, verifier gates, scheduler scoring, CLI tooling.
-- `tmp/` – scratch workspace (ignored by git). Historic large exports remain under `archives/` if needed.
+- `verification/` – literature review materials and OCR references.
 
 ## Configuration
 `configs/run.yaml` centralises proposer configuration:
@@ -84,28 +88,26 @@ Export the appropriate API key (`XAI_API_KEY`, `OPENAI_API_KEY`, `RUNPOD_API_KEY
 - `scripts/parallel_runner.py` - parallelise proposer/verifier workloads; `scripts/probe_grok_rate.py` sizes safe LLM concurrency.
 
 ## Datasets and metrics (Oct 2025 snapshot)
-- **Rules baseline (full corpus)** – 13,589 / 13,656 fixes (99.5 percent) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
-- **Grok full corpus** – 1,313 / 1,313 accepted (100 percent) with median JSON Patch length 6 (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
-- **Secondary supported corpus** – 1,264 / 1,264 accepted in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
+- **Rules baseline (full corpus)** – 13,589 / 13,656 patched items (99.51%) (auto-fix rate 0.8646 over 15,718 detections) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
+- **Grok manifest slice** – 1,313 / 1,313 accepted (100.00%) (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
+- **Grok 5k corpus** – 4,426 / 5,000 accepted (88.52%) (see `data/batch_runs/grok_5k/metrics_grok5k.json`).
+- **Secondary supported corpus** – 1,264 / 1,264 accepted (100.00%) in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
 - Policy-level success probabilities and runtimes are regenerated via `scripts/compute_policy_metrics.py` into `data/policy_metrics.json`.
 - Scheduler evaluation (`docs/scheduler_visualisation.md`, viewable at `data/outputs/scheduler/metrics_schedule_sweep.json`) compares bandit, risk-only, and FIFO strategies.
 
 Large corpus artefacts now live under `data/outputs/` and are stored as compressed `.json.gz` files to keep the repository lean. Run `gunzip data/patches_rules_full.json.gz` (and the verified counterpart) before tooling that expects the plain `.json` filenames.
 
-## Roadmap
-- Q4 2025 - publish a containerised reproducibility bundle for one-command replays.
-- Q1 2026 - rerun Grok corpora with live latency/token telemetry.
-- Q1 2026 - validate against an external CNCF corpus.
-- Q2 2026 - expand operator studies and incorporate threat-mitigation guard metadata into CI.
+## Comparison of automated Kubernetes remediation systems
 
-## Related work
-| System | Acceptance / fix rate | Corpus | Guardrail highlights | Scheduling |
-| ------ | -------------------- | ------ | ------------------- | ---------- |
-| **k8s-auto-fix** | 88.78% (Grok-5k), 93.54% / 100% (supported rules), 100% (Grok 1.313k) | 5k + 1.3k manifests | Secret sanitisation, privileged DaemonSet hardening, CRD seeding, triad verification | Bandit scheduler with policy metrics |
-| GenKubeSec (2024) | ~85-92% (curated 200) | 200 manifests | LLM reasoning with human review | None |
-| Kyverno (2023+) | 80-95% (policy mutation) | Thousands | Policy-driven mutation/generation | Admission queue |
-
-Note: Production SRE automation systems (e.g., Google Borg) discuss automation principles publicly, but we do not cite a public acceptance percentage and therefore avoid drawing numeric comparisons.
+| Capability | k8s-auto-fix (this work) | GenKubeSec (2024) | Kyverno (2023+) | Borg/SRE (2015+) |
+| :--- | :--- | :--- | :--- | :--- |
+| **Primary Goal** | Closed-loop hardening (detect→patch→verify→prioritize) | LLM-based detection/remediation suggestions | Admission-time policy enforcement | Large-scale auto-remediation in production clusters |
+| **Fix Mode** | JSON Patch (rules + optional LLM) | LLM-generated YAML edits | Policy mutation/generation | Custom controllers and playbooks |
+| **Guardrails** | Policy re-check + schema + `kubectl apply --dry-run=server` + privileged/secret sanitization + CRD seeding | Manual review; no automated gates | Validation/mutation webhooks; assumes controllers | Health checks, automated rollback, throttling |
+| **Risk Prioritization** | Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) | Not implemented | FIFO admission queue | Priority queues / toil budgets |
+| **Evaluation Corpus** | 15,718 detections (rules+guardrails: 13,589/13,656 patched = 99.51%; auto-fix 0.8646); 1,000 live-cluster manifests (100.0% success); 5,000 Grok manifests (88.52%) | 200 curated manifests (85–92% accuracy) | Thousands of user manifests (80–95% mutation acceptance) | Millions of production workloads (no public acceptance %) |
+| **Telemetry** | Policy-level success probabilities, latency histograms, failure taxonomy | Token/cost estimates; no pipeline telemetry | Admission latency <45 ms, violation counts | MTTR, incident counts, operator feedback |
+| **Outstanding Gaps** | Infrastructure-dependent rejects, operator study, scheduled guidance refresh in CI | Automated guardrails, risk-aware ordering | LLM-aware patching, risk-aware scheduling | Declarative manifest fixes, static analysis integration |
 
 ## Baselines and Reproducibility
 
@@ -122,4 +124,3 @@ scripts/reproduce_all.sh
 ```
 
 See `ARTIFACTS.md` for artifact map, `docs/VERIFIER.md` for guardrails, `docs/BASELINES.md` to run baselines, `docs/RISK_EVAL.md` for prioritization metrics, and `docs/LIVE_EVAL.md` for live-cluster methodology.
-| Magpie (2024) | ~84% dry-run acceptance | 9.5k manifests | RBAC and PSP static analysis | None |
diff --git a/data/batch_runs/grok_5k/metrics_grok5k.json b/data/batch_runs/grok_5k/metrics_grok5k.json
@@ -14,4 +14,4 @@
     "completion_tokens": 689779.0,
     "total_tokens": 11399926.0
   }
-}
+}
diff --git a/data/eval/unified_eval_summary.json b/data/eval/unified_eval_summary.json
@@ -52,13 +52,13 @@
     }
   },
   {
-    "dataset": "Manifest 1.313k",
+    "dataset": "Full Corpus (Rules)",
     "mode": "rules",
     "seed": 1337,
     "note": "Full manifest slice in deterministic rules mode.",
-    "total": 13656,
+    "total": 15718,
     "accepted": 13589,
-    "acceptance_rate": 0.9951,
+    "acceptance_rate": 0.8646,
     "median_patch_ops": 8,
     "proposer_latency_ms": {
       "count": 0,

diff --git a/data/metrics_rules_full.json b/data/metrics_rules_full.json
@@ -1,9 +1,9 @@
 {
-  "detections": 13656,
+  "detections": 15718,
   "patches": 13656,
   "verified": 13656,
   "accepted": 13589,
-  "auto_fix_rate": 0.9951,
+  "auto_fix_rate": 0.8646,
   "median_patch_ops": 8,
   "failed_policy": 63,
   "failed_schema": 63,
@@ -14,4 +14,4 @@
     "completion_tokens": 0.0,
     "total_tokens": 0.0
   }
-}
+}
diff --git a/figures/admission_vs_posthoc.png b/figures/admission_vs_posthoc.png
diff --git a/figures/fairness_waits.png b/figures/fairness_waits.png
diff --git a/figures/mode_comparison.png b/figures/mode_comparison.png
diff --git a/figures/operator_ab.png b/figures/operator_ab.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -14,4 +14,4 @@ @@
         "completion_tokens": 689779.0,
         "total_tokens": 11399926.0
       }
-    }
+    }