Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 23 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# k8s-auto-fix
# Closed-Loop Threat-Guided Auto-Fixing of Kubernetes YAML Security Misconfigurations

`k8s-auto-fix` is a closed-loop pipeline that detects Kubernetes misconfigurations, proposes JSON patches, verifies them against guardrails, and schedules accepted fixes. It supports deterministic rules as well as Grok and OpenAI-compatible LLM modes, and underpins the accompanying research paper.
## Abstract
Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built `k8s-auto-fix` to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9×. We release all data and scripts so others can reproduce these results.

## Key features
- End-to-end detector -> proposer -> verifier -> risk -> scheduler -> queue workflow with reproducible CLI entry points.
Expand Down Expand Up @@ -40,15 +41,18 @@ Benchmark helpers (`make benchmark-grok200`, `make benchmark-full`, `make benchm
- `archives/` – historical exports and large bundles kept out of the active workspace.
- `configs/` – pipeline presets (`run.yaml`, `run_grok.yaml`, `run_rules.yaml`).
- `data/` – retains the canonical folders (`data/manifests`, `data/batch_runs`, etc.) and now exposes curated views via `data/corpora/` (inputs) and `data/outputs/` (generated artefacts). See `data/README.md` for details.
- `docker/` – container definition for reproducible environment.
- `docs/` – research notes, policy guidance, reproducibility appendices, future work plans.
- `infra/fixtures/` – RBAC, NetworkPolicies, and manifest samples (CronJob scanner, Bitnami PostgreSQL) for reproducing edge cases.
- `figures/` – plots and diagrams used in the research paper and README.
- `infra/` – infrastructure definitions (`fixtures/`, `crds/`).
- `logs/` – proposer/verifier transcripts, Grok sweep summaries, and root-level logs (e.g. `logs/access.log`).
- `notes/` – working notes and backlog items formerly at the repository root.
- `notes/` – working notes and backlog items.
- `paper/` – IEEE Access manuscript sources; archives in `paper/archives/` and the Overleaf export tracked under `paper/overleaf/`.
- `policies/` – baseline policy definitions (e.g. Kyverno mutating rules).
- `scripts/` – maintenance and evaluation helpers; see `scripts/README.md` for an index by pipeline stage.
- `src/` – core packages (`common`, `detector`, `proposer`, `risk`, `scheduler`, `verifier`).
- `tests/` – pytest suite validating detectors, proposer guardrails, verifier gates, scheduler scoring, CLI tooling.
- `tmp/` – scratch workspace (ignored by git). Historic large exports remain under `archives/` if needed.
- `verification/` – literature review materials and OCR references.

## Configuration
`configs/run.yaml` centralises proposer configuration:
Expand Down Expand Up @@ -84,28 +88,26 @@ Export the appropriate API key (`XAI_API_KEY`, `OPENAI_API_KEY`, `RUNPOD_API_KEY
- `scripts/parallel_runner.py` - parallelise proposer/verifier workloads; `scripts/probe_grok_rate.py` sizes safe LLM concurrency.

## Datasets and metrics (Oct 2025 snapshot)
- **Rules baseline (full corpus)** – 13,589 / 13,656 fixes (99.5 percent) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
- **Grok full corpus** – 1,313 / 1,313 accepted (100 percent) with median JSON Patch length 6 (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
- **Secondary supported corpus** – 1,264 / 1,264 accepted in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
- **Rules baseline (full corpus)** – 13,589 / 13,656 patched items (99.51%) (auto-fix rate 0.8646 over 15,718 detections) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
- **Grok manifest slice** – 1,313 / 1,313 accepted (100.00%) (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
- **Grok 5k corpus** – 4,426 / 5,000 accepted (88.52%) (see `data/batch_runs/grok_5k/metrics_grok5k.json`).
- **Secondary supported corpus** – 1,264 / 1,264 accepted (100.00%) in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
- Policy-level success probabilities and runtimes are regenerated via `scripts/compute_policy_metrics.py` into `data/policy_metrics.json`.
- Scheduler evaluation (`docs/scheduler_visualisation.md`, viewable at `data/outputs/scheduler/metrics_schedule_sweep.json`) compares bandit, risk-only, and FIFO strategies.

Large corpus artefacts now live under `data/outputs/` and are stored as compressed `.json.gz` files to keep the repository lean. Run `gunzip data/patches_rules_full.json.gz` (and the verified counterpart) before tooling that expects the plain `.json` filenames.

## Roadmap
- Q4 2025 - publish a containerised reproducibility bundle for one-command replays.
- Q1 2026 - rerun Grok corpora with live latency/token telemetry.
- Q1 2026 - validate against an external CNCF corpus.
- Q2 2026 - expand operator studies and incorporate threat-mitigation guard metadata into CI.
## Comparison of automated Kubernetes remediation systems

## Related work
| System | Acceptance / fix rate | Corpus | Guardrail highlights | Scheduling |
| ------ | -------------------- | ------ | ------------------- | ---------- |
| **k8s-auto-fix** | 88.78% (Grok-5k), 93.54% / 100% (supported rules), 100% (Grok 1.313k) | 5k + 1.3k manifests | Secret sanitisation, privileged DaemonSet hardening, CRD seeding, triad verification | Bandit scheduler with policy metrics |
| GenKubeSec (2024) | ~85-92% (curated 200) | 200 manifests | LLM reasoning with human review | None |
| Kyverno (2023+) | 80-95% (policy mutation) | Thousands | Policy-driven mutation/generation | Admission queue |

Note: Production SRE automation systems (e.g., Google Borg) discuss automation principles publicly, but we do not cite a public acceptance percentage and therefore avoid drawing numeric comparisons.
| Capability | k8s-auto-fix (this work) | GenKubeSec (2024) | Kyverno (2023+) | Borg/SRE (2015+) |
| :--- | :--- | :--- | :--- | :--- |
| **Primary Goal** | Closed-loop hardening (detect→patch→verify→prioritize) | LLM-based detection/remediation suggestions | Admission-time policy enforcement | Large-scale auto-remediation in production clusters |
| **Fix Mode** | JSON Patch (rules + optional LLM) | LLM-generated YAML edits | Policy mutation/generation | Custom controllers and playbooks |
| **Guardrails** | Policy re-check + schema + `kubectl apply --dry-run=server` + privileged/secret sanitization + CRD seeding | Manual review; no automated gates | Validation/mutation webhooks; assumes controllers | Health checks, automated rollback, throttling |
| **Risk Prioritization** | Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) | Not implemented | FIFO admission queue | Priority queues / toil budgets |
| **Evaluation Corpus** | 15,718 detections (rules+guardrails: 13,589/13,656 patched = 99.51%; auto-fix 0.8646); 1,000 live-cluster manifests (100.0% success); 5,000 Grok manifests (88.52%) | 200 curated manifests (85–92% accuracy) | Thousands of user manifests (80–95% mutation acceptance) | Millions of production workloads (no public acceptance %) |
| **Telemetry** | Policy-level success probabilities, latency histograms, failure taxonomy | Token/cost estimates; no pipeline telemetry | Admission latency <45 ms, violation counts | MTTR, incident counts, operator feedback |
| **Outstanding Gaps** | Infrastructure-dependent rejects, operator study, scheduled guidance refresh in CI | Automated guardrails, risk-aware ordering | LLM-aware patching, risk-aware scheduling | Declarative manifest fixes, static analysis integration |

## Baselines and Reproducibility

Expand All @@ -122,4 +124,3 @@ scripts/reproduce_all.sh
```

See `ARTIFACTS.md` for artifact map, `docs/VERIFIER.md` for guardrails, `docs/BASELINES.md` to run baselines, `docs/RISK_EVAL.md` for prioritization metrics, and `docs/LIVE_EVAL.md` for live-cluster methodology.
| Magpie (2024) | ~84% dry-run acceptance | 9.5k manifests | RBAC and PSP static analysis | None |
2 changes: 1 addition & 1 deletion data/batch_runs/grok_5k/metrics_grok5k.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@
"completion_tokens": 689779.0,
"total_tokens": 11399926.0
}
}
}
6 changes: 3 additions & 3 deletions data/eval/unified_eval_summary.json
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,13 @@
}
},
{
"dataset": "Manifest 1.313k",
"dataset": "Full Corpus (Rules)",
"mode": "rules",
"seed": 1337,
"note": "Full manifest slice in deterministic rules mode.",
"total": 13656,
"total": 15718,
"accepted": 13589,
"acceptance_rate": 0.9951,
"acceptance_rate": 0.8646,
"median_patch_ops": 8,
"proposer_latency_ms": {
"count": 0,
Expand Down
6 changes: 3 additions & 3 deletions data/metrics_rules_full.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
{
"detections": 13656,
"detections": 15718,
"patches": 13656,
"verified": 13656,
"accepted": 13589,
"auto_fix_rate": 0.9951,
"auto_fix_rate": 0.8646,
"median_patch_ops": 8,
"failed_policy": 63,
"failed_schema": 63,
Expand All @@ -14,4 +14,4 @@
"completion_tokens": 0.0,
"total_tokens": 0.0
}
}
}
Binary file modified figures/admission_vs_posthoc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figures/fairness_waits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figures/mode_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figures/operator_ab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading