diff --git a/README.md b/README.md
index d3b36c26..9ec65b8b 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,7 @@
-# k8s-auto-fix
+# Closed-Loop Threat-Guided Auto-Fixing of Kubernetes YAML Security Misconfigurations
 
-`k8s-auto-fix` is a closed-loop pipeline that detects Kubernetes misconfigurations, proposes JSON patches, verifies them against guardrails, and schedules accepted fixes. It supports deterministic rules as well as Grok and OpenAI-compatible LLM modes, and underpins the accompanying research paper.
+## Abstract
+Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built `k8s-auto-fix` to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9×. We release all data and scripts so others can reproduce these results.
 
 ## Key features
 - End-to-end detector -> proposer -> verifier -> risk -> scheduler -> queue workflow with reproducible CLI entry points.
@@ -40,15 +41,18 @@ Benchmark helpers (`make benchmark-grok200`, `make benchmark-full`, `make benchm
 - `archives/` – historical exports and large bundles kept out of the active workspace.
 - `configs/` – pipeline presets (`run.yaml`, `run_grok.yaml`, `run_rules.yaml`).
 - `data/` – retains the canonical folders (`data/manifests`, `data/batch_runs`, etc.) and now exposes curated views via `data/corpora/` (inputs) and `data/outputs/` (generated artefacts). See `data/README.md` for details.
+- `docker/` – container definition for reproducible environment.
 - `docs/` – research notes, policy guidance, reproducibility appendices, future work plans.
-- `infra/fixtures/` – RBAC, NetworkPolicies, and manifest samples (CronJob scanner, Bitnami PostgreSQL) for reproducing edge cases.
+- `figures/` – plots and diagrams used in the research paper and README.
+- `infra/` – infrastructure definitions (`fixtures/`, `crds/`).
 - `logs/` – proposer/verifier transcripts, Grok sweep summaries, and root-level logs (e.g. `logs/access.log`).
-- `notes/` – working notes and backlog items formerly at the repository root.
+- `notes/` – working notes and backlog items.
 - `paper/` – IEEE Access manuscript sources; archives in `paper/archives/` and the Overleaf export tracked under `paper/overleaf/`.
+- `policies/` – baseline policy definitions (e.g. Kyverno mutating rules).
 - `scripts/` – maintenance and evaluation helpers; see `scripts/README.md` for an index by pipeline stage.
 - `src/` – core packages (`common`, `detector`, `proposer`, `risk`, `scheduler`, `verifier`).
 - `tests/` – pytest suite validating detectors, proposer guardrails, verifier gates, scheduler scoring, CLI tooling.
-- `tmp/` – scratch workspace (ignored by git). Historic large exports remain under `archives/` if needed.
+- `verification/` – literature review materials and OCR references.
 
 ## Configuration
 `configs/run.yaml` centralises proposer configuration:
@@ -84,28 +88,26 @@ Export the appropriate API key (`XAI_API_KEY`, `OPENAI_API_KEY`, `RUNPOD_API_KEY
 - `scripts/parallel_runner.py` - parallelise proposer/verifier workloads; `scripts/probe_grok_rate.py` sizes safe LLM concurrency.
 
 ## Datasets and metrics (Oct 2025 snapshot)
-- **Rules baseline (full corpus)** – 13,589 / 13,656 fixes (99.5 percent) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
-- **Grok full corpus** – 1,313 / 1,313 accepted (100 percent) with median JSON Patch length 6 (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
-- **Secondary supported corpus** – 1,264 / 1,264 accepted in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
+- **Rules baseline (full corpus)** – 13,589 / 13,656 patched items (99.51%) (auto-fix rate 0.8646 over 15,718 detections) with median JSON Patch length 8 (`data/patches_rules_full.json.gz`, `data/verified_rules_full.json.gz`, `data/metrics_rules_full.json`; decompress the `.json.gz` files before consuming them).
+- **Grok manifest slice** – 1,313 / 1,313 accepted (100.00%) (curated view `data/outputs/batch_runs/grok_full/metrics_grok_full.json` points to the canonical `data/batch_runs/grok_full/metrics_grok_full.json`).
+- **Grok 5k corpus** – 4,426 / 5,000 accepted (88.52%) (see `data/batch_runs/grok_5k/metrics_grok5k.json`).
+- **Secondary supported corpus** – 1,264 / 1,264 accepted (100.00%) in rules mode; artefacts and telemetry live at `data/batch_runs/secondary_supported/` with a companion symlink view under `data/outputs/batch_runs/secondary_supported/`.
 - Policy-level success probabilities and runtimes are regenerated via `scripts/compute_policy_metrics.py` into `data/policy_metrics.json`.
 - Scheduler evaluation (`docs/scheduler_visualisation.md`, viewable at `data/outputs/scheduler/metrics_schedule_sweep.json`) compares bandit, risk-only, and FIFO strategies.
 
 Large corpus artefacts now live under `data/outputs/` and are stored as compressed `.json.gz` files to keep the repository lean. Run `gunzip data/patches_rules_full.json.gz` (and the verified counterpart) before tooling that expects the plain `.json` filenames.
 
-## Roadmap
-- Q4 2025 - publish a containerised reproducibility bundle for one-command replays.
-- Q1 2026 - rerun Grok corpora with live latency/token telemetry.
-- Q1 2026 - validate against an external CNCF corpus.
-- Q2 2026 - expand operator studies and incorporate threat-mitigation guard metadata into CI.
+## Comparison of automated Kubernetes remediation systems
 
-## Related work
-| System | Acceptance / fix rate | Corpus | Guardrail highlights | Scheduling |
-| ------ | -------------------- | ------ | ------------------- | ---------- |
-| **k8s-auto-fix** | 88.78% (Grok-5k), 93.54% / 100% (supported rules), 100% (Grok 1.313k) | 5k + 1.3k manifests | Secret sanitisation, privileged DaemonSet hardening, CRD seeding, triad verification | Bandit scheduler with policy metrics |
-| GenKubeSec (2024) | ~85-92% (curated 200) | 200 manifests | LLM reasoning with human review | None |
-| Kyverno (2023+) | 80-95% (policy mutation) | Thousands | Policy-driven mutation/generation | Admission queue |
-
-Note: Production SRE automation systems (e.g., Google Borg) discuss automation principles publicly, but we do not cite a public acceptance percentage and therefore avoid drawing numeric comparisons.
+| Capability | k8s-auto-fix (this work) | GenKubeSec (2024) | Kyverno (2023+) | Borg/SRE (2015+) |
+| :--- | :--- | :--- | :--- | :--- |
+| **Primary Goal** | Closed-loop hardening (detect→patch→verify→prioritize) | LLM-based detection/remediation suggestions | Admission-time policy enforcement | Large-scale auto-remediation in production clusters |
+| **Fix Mode** | JSON Patch (rules + optional LLM) | LLM-generated YAML edits | Policy mutation/generation | Custom controllers and playbooks |
+| **Guardrails** | Policy re-check + schema + `kubectl apply --dry-run=server` + privileged/secret sanitization + CRD seeding | Manual review; no automated gates | Validation/mutation webhooks; assumes controllers | Health checks, automated rollback, throttling |
+| **Risk Prioritization** | Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) | Not implemented | FIFO admission queue | Priority queues / toil budgets |
+| **Evaluation Corpus** | 15,718 detections (rules+guardrails: 13,589/13,656 patched = 99.51%; auto-fix 0.8646); 1,000 live-cluster manifests (100.0% success); 5,000 Grok manifests (88.52%) | 200 curated manifests (85–92% accuracy) | Thousands of user manifests (80–95% mutation acceptance) | Millions of production workloads (no public acceptance %) |
+| **Telemetry** | Policy-level success probabilities, latency histograms, failure taxonomy | Token/cost estimates; no pipeline telemetry | Admission latency <45 ms, violation counts | MTTR, incident counts, operator feedback |
+| **Outstanding Gaps** | Infrastructure-dependent rejects, operator study, scheduled guidance refresh in CI | Automated guardrails, risk-aware ordering | LLM-aware patching, risk-aware scheduling | Declarative manifest fixes, static analysis integration |
 
 ## Baselines and Reproducibility
 
@@ -122,4 +124,3 @@ scripts/reproduce_all.sh
 ```
 
 See `ARTIFACTS.md` for artifact map, `docs/VERIFIER.md` for guardrails, `docs/BASELINES.md` to run baselines, `docs/RISK_EVAL.md` for prioritization metrics, and `docs/LIVE_EVAL.md` for live-cluster methodology.
-| Magpie (2024) | ~84% dry-run acceptance | 9.5k manifests | RBAC and PSP static analysis | None |
diff --git a/data/batch_runs/grok_5k/metrics_grok5k.json b/data/batch_runs/grok_5k/metrics_grok5k.json
index 9bc2e3a9..5c24aa62 100644
--- a/data/batch_runs/grok_5k/metrics_grok5k.json
+++ b/data/batch_runs/grok_5k/metrics_grok5k.json
@@ -14,4 +14,4 @@
     "completion_tokens": 689779.0,
     "total_tokens": 11399926.0
   }
-}
\ No newline at end of file
+}
diff --git a/data/eval/unified_eval_summary.json b/data/eval/unified_eval_summary.json
index d959cb95..378b36f0 100644
--- a/data/eval/unified_eval_summary.json
+++ b/data/eval/unified_eval_summary.json
@@ -52,13 +52,13 @@
     }
   },
   {
-    "dataset": "Manifest 1.313k",
+    "dataset": "Full Corpus (Rules)",
     "mode": "rules",
     "seed": 1337,
     "note": "Full manifest slice in deterministic rules mode.",
-    "total": 13656,
+    "total": 15718,
     "accepted": 13589,
-    "acceptance_rate": 0.9951,
+    "acceptance_rate": 0.8646,
     "median_patch_ops": 8,
     "proposer_latency_ms": {
       "count": 0,
diff --git a/data/metrics_rules_full.json b/data/metrics_rules_full.json
index bc5261f0..f5e01563 100644
--- a/data/metrics_rules_full.json
+++ b/data/metrics_rules_full.json
@@ -1,9 +1,9 @@
 {
-  "detections": 13656,
+  "detections": 15718,
   "patches": 13656,
   "verified": 13656,
   "accepted": 13589,
-  "auto_fix_rate": 0.9951,
+  "auto_fix_rate": 0.8646,
   "median_patch_ops": 8,
   "failed_policy": 63,
   "failed_schema": 63,
@@ -14,4 +14,4 @@
     "completion_tokens": 0.0,
     "total_tokens": 0.0
   }
-}
\ No newline at end of file
+}
diff --git a/figures/admission_vs_posthoc.png b/figures/admission_vs_posthoc.png
index ea92ac66..30bcc07c 100644
Binary files a/figures/admission_vs_posthoc.png and b/figures/admission_vs_posthoc.png differ
diff --git a/figures/fairness_waits.png b/figures/fairness_waits.png
index 195d3e4b..b473d245 100644
Binary files a/figures/fairness_waits.png and b/figures/fairness_waits.png differ
diff --git a/figures/mode_comparison.png b/figures/mode_comparison.png
index 0b1542b4..86188a7c 100644
Binary files a/figures/mode_comparison.png and b/figures/mode_comparison.png differ
diff --git a/figures/operator_ab.png b/figures/operator_ab.png
index d0fb12b1..9dccc1b5 100644
Binary files a/figures/operator_ab.png and b/figures/operator_ab.png differ
diff --git a/final-to-do.md b/final-to-do.md
deleted file mode 100644
index 962f1917..00000000
--- a/final-to-do.md
+++ /dev/null
@@ -1,273 +0,0 @@
-# Final To-Do
-
-## Paper Quality Evaluation (Nov 11, 2025)
-**Overall Score: 8.1/10** - Strong Accept with Minor Revisions
-
-### HIGH PRIORITY (Critical for publication rigor)
-
-1. **[SUBSTANTIVE] Add statistical significance tests** ⚗️  
-   **Effort**: 2-3 hours | **Impact**: Critical for top-tier venues  
-   - Add p-values to Table 4 (eval_summary) for acceptance rate comparisons
-   - Use proportion z-tests for acceptance rates (88.78% vs 99.51%)
-   - Use Mann-Whitney U tests for latency comparisons (non-parametric)
-   - Add note under tables: "p < 0.001 for all pairwise comparisons"
-   - **Tool**: Python `scipy.stats.proportions_ztest()` and `mannwhitneyu()`
-   - **Gap**: This is the only substantive scientific gap - everything else is presentation
-   _Status: Completed via `scripts/eval_significance.py`, `data/eval/significance_tests.json`, and the new Table~\ref{tab:eval_summary} note._
-
-2. **[POLISH] Break up mega-sentences in evaluation section** ✂️  
-   **Effort**: 30 minutes | **Impact**: High readability improvement  
-   - Lines 516-520: One sentence spanning 6+ lines - split into 4 sentences
-   - Lines 833-834: Discussion mega-paragraph - split into 6 sentences
-   - Target: No sentence longer than 3 lines in compiled PDF
-   _Status: Evaluation section rewritten at `paper/access.tex:521-542`. Discussion section broken into 4 logical paragraphs at `paper/access.tex:853-859`. **✅ COMPLETE (Nov 11, 2025)**_
-
-3. **[POLISH] Remove file paths from abstract** 🎨  
-   **Effort**: 5 minutes | **Impact**: Convention compliance  
-   - Line 92: Remove `\texttt{data/live\_cluster/results\_1k.json}` from abstract
-   - Keep file paths in body text where they provide reproducibility value
-   - Abstract should be self-contained without implementation details
-   _Status: Abstract now cites only the success rate (see `paper/access.tex:88-90`)._
-
-4. **[SUBSTANTIVE] Add explicit threat model subsection** 🛡️  
-   **Effort**: 1 hour | **Impact**: Medium (critical for security venues)  
-   - Lines 633-635 mention "malicious manifests" but don't define adversary
-   - Add "Threat Model" subsection in Section 4 (before or after "Threats and Mitigations")
-   - Define: trusted components (detector, verifier), untrusted inputs (manifests, LLM outputs)
-   - State which attacks are in/out of scope (supply chain, prompt injection, fixture poisoning)
-   _Status: Section~4.1 now documents the threat model (`paper/access.tex:608-620`)._
-
-### MEDIUM PRIORITY (Enhances rigor and clarity)
-
-5. **[SUBSTANTIVE] Define fairness metrics explicitly** 📐  
-   **Effort**: 1 hour | **Impact**: Medium  
-   - Line 526: "Gini 0.351, starvation rate 0" appears without definition
-   - Define starvation threshold (e.g., "items waiting >24 hours")
-   - Compare Gini to FIFO baseline Gini (is 0.351 good or bad?)
-   - Add fairness plot showing wait time distribution by risk quartile
-   _Status: Definitions reference \url{data/scheduler/fairness_metrics.json} and Figure~\ref{fig:fairness} in `paper/access.tex:812-826`._
-
-6. **[POLISH] Add figure interpretations inline** 📊  
-   **Effort**: 30 minutes | **Impact**: Medium  
-   - After Fig 2 reference: Add 1-2 sentences interpreting trends
-   - Example: "Figure 2 shows rules-only maintains 99.5% acceptance while LLM drops to 88.8%"
-   - Apply to all figures (admission_vs_posthoc, mode_comparison, operator_ab)
-   _Status: Interpretations accompany Figures~\ref{fig:mode_comparison}, \ref{fig:admission_vs_posthoc}, \ref{fig:fairness}, and \ref{fig:operator_ab}. _
-
-7. **[POLISH] Consolidate notation** 📝  
-   **Effort**: 1 hour | **Impact**: Medium  
-   - Risk score formula appears at lines 698, 702, and Appendix 843+
-   - Create single "Notation" box in Section 3
-   - Reference back consistently: "as defined in Eq. (1)"
-   - Consider notation table if symbols exceed 10
-   _Status: Shared notation now lives in `paper/access.tex:201-207` and is cited by the scheduler equation._
-
-8. **[REFACTOR] Restructure approach vs implementation sections** 🏗️  
-   **Effort**: 2 hours | **Impact**: Medium (clarity)  
-   - Current: "Approach Summary" (line 198) comes before "Implementation" (312) but leaks details
-   - Proposed structure:
-    - Section 2: System Design (architecture, guardrails, conceptual flow)
-    - Section 3: Implementation (code, artifacts, metrics definitions)
-    - Section 4: Evaluation (results, baselines, ablations)
-    - Section 5: Discussion
-   _Status: Sections now read "System Design," "Implementation and Metrics," and "Evaluation" (`paper/access.tex:198-324`)._
-
-### LOW PRIORITY (Polish only - nice to have)
-
-9. **[POLISH] Add acronym table or expand more frequently** 🔤  
-   **Effort**: 30 minutes | **Impact**: Low  
-   - Heavy acronym use: PSS, CIS, KEV, EPSS, RAG, CVE, CVSS, MAP, CEL, MTTR, RBAC, CRD, CTI
-   - Option A: Add acronym table in front matter
-   - Option B: Re-expand acronyms if first use was >5 pages ago
-   _Status: Appendix~\ref{app:acronyms} catalogs the acronyms._
-
-10. **[POLISH] Consistency pass on code font** 💻  
-    **Effort**: 15 minutes | **Impact**: Low  
-    - Sometimes `\texttt{kubectl}` (correct), sometimes plain "kubectl" (line 344)
-    - Run grep for tool names and wrap consistently in `\texttt{}`
-    - Apply to: kubectl, kube-linter, helm, docker, python
-    _Status: Tool names in the environment table and ArtifactHub section now use `\texttt{}`._
-
-11. **[POLISH] Consider footnotes for long commands** 📝  
-    **Effort**: 30 minutes | **Impact**: Low  
-    - Line 874: `python scripts/\allowbreak collect_artifacthub.py\ --limit\ 5000` still overflows
-    - Use footnotes for commands longer than ~60 characters
-    - Keeps main text cleaner
-    _Status: ArtifactHub instructions now cite the command in a footnote (`paper/access.tex:874-875`)._
-
----
-
-## Outstanding Repository Tasks
-
-- _None._ Item 26 in `notes/to-do list` is now closed (Nov 14) with a documented rationale for keeping the 1k AKS replay as the terminal live-cluster sweep (policy/resource coverage + \$4–5k cost avoidance); see `notes/to-do list:161-164` for details.
-
-## Submission-Readiness Follow-ups
-
-- **Confirm future-work placeholders**  
-  Validate that references to future experiments (expanded detector validation, new scheduler ablations, large-corpus latency telemetry) are either executed or clearly labeled as future work before submission. (Source: user "Progress & Decisions" summary.)
-
-- **Recompile after final edits**  
-  Continue running `pdflatex -interaction=nonstopmode -halt-on-error paper/access.tex` whenever new changes land so `paper/access.pdf` remains in sync with `paper/access.tex`. (Source: user "Next Steps".)
-
-- **Refresh Grok/LLM latency metrics on request** *(OVERLAPS with High Priority Item 1)*  
-  Table \ref{tab:eval_summary} still shows "—" for Grok timing; be prepared to regenerate latency telemetry (`data/batch_runs/grok200_latency_summary.csv`, `data/batch_runs/grok_5k/metrics_grok5k.json`) if updated numbers are required. (Source: user "Next Steps".)
-
----
-
-## OVERLAP ANALYSIS
-
-### Direct Overlaps:
-- ✅ **"Refresh Grok/LLM latency metrics"** overlaps with **High Priority Item 1** (statistical tests need this data)
-- ✅ **"Recompile after final edits"** already covered - applies to all paper changes
-
-### Complementary Items:
-- **Repository task "Live-cluster sweep expansion"** + **Paper item "Statistical tests"** = Both strengthen evaluation rigor
-- **"Confirm future-work placeholders"** + **Paper item "Threat model"** = Both clarify scope/assumptions
-
----
-
-## WHAT ELSE NEEDS TO BE DONE (Gap Analysis)
-
-### Missing from Original To-Do:
-1. ❌ **No mention of statistical rigor** - This is the BIGGEST gap  
-   **Action**: High Priority Item 1 (stat tests) addresses this
-
-2. ❌ **No readability/writing quality items** - Paper is very dense  
-   **Action**: High Priority Items 2-3 (break sentences, clean abstract) address this
-
-3. ❌ **No security model clarity** - Paper mentions threats but doesn't formalize  
-   **Action**: High Priority Item 4 (threat model) addresses this
-
-### Still Missing (New Items to Consider):
-
-1. **Experimental reproducibility verification**  
-   - Has anyone external run `make detect && make propose && make verify`?
-   - Consider VM/container test before submission
-   - Add to "Submission-Readiness": "External reproducibility smoke test"
-
-2. **Acknowledgments section**  
-   - Paper currently has placeholder funding note (line 86)
-   - Need to acknowledge: dataset sources, infrastructure providers, reviewers
-   - Add to "Submission-Readiness": "Finalize acknowledgments"
-
-3. **Author biographies completeness**  
-   - Lines 952-964: Biographies present but photos may need verification
-   - Ensure `brian_mendonca_photo.png` and `vijay_madisetti_photo.png` exist and are high-res
-   - Add to "Submission-Readiness": "Verify author photos"
-
-4. **Bibliography completeness**  
-   - Lines 886-947: 17 references (seems low for systems paper)
-   - Consider adding: Kubernetes security surveys, policy enforcement papers, bandit algorithm foundations
-   - Add to "Medium Priority": "Expand related work citations (target 25-30 refs)"
-
-5. **LaTeX compilation warnings check**  
-   - Run with `-file-line-error` flag to catch overfull hboxes, undefined refs
-   - Add to "Submission-Readiness": "Fix all LaTeX warnings"
-
----
-
-## RECOMMENDED EXECUTION ORDER
-
-**Week 1** (Critical path):
-1. High Priority Item 1 (stat tests) - 3 hours
-2. High Priority Item 2 (break sentences) - 30 min
-3. High Priority Item 3 (clean abstract) - 5 min
-4. Recompile and verify PDF
-
-**Week 2** (If time permits):
-5. High Priority Item 4 (threat model) - 1 hour
-6. Medium Priority Items 5-6 (fairness metrics, figure interp) - 1.5 hours
-7. Address new item: Bibliography expansion - 2 hours
-8. Final recompile
-
-**Before Submission**:
-9. External reproducibility test (new item)
-10. Finalize acknowledgments (new item)
-11. Verify author photos (new item)
-12. Fix all LaTeX warnings (new item)
-
----
-
-## PUBLICATION READINESS ESTIMATE
-
-**Current State**: 75th percentile (solid work)  
-**After High Priority items**: 85th percentile (strong accept)  
-**After High + Medium items**: 90th percentile (reference paper)  
-**After all items**: 95th percentile (exemplary)
-
-**Estimated total effort**: 8-12 hours for High Priority, 16-20 hours for complete
-
----
-
-## SUMMARY FOR MR. BRIAN
-
-### The Brutal Truth:
-- **60%** of revisions are pure polish (readability, formatting)
-- **30%** are methodological substance (stat tests, fairness definitions)
-- **10%** are architectural clarity (threat model)
-
-### The 80/20 Fix:
-Do these 2 items for 80% of the benefit:
-1. **Add statistical significance tests** (2-3 hours) - Only substantive gap
-2. **Break up mega-sentences** (30 min) - Biggest readability win
-
-Everything else is "nice to have" polish.
-
-### Original Status (Before Revisions):
-Your paper was **already at 75th percentile** for technical quality. These revisions pushed it to 90-95th percentile. The core science was solid - we just made reviewers' lives easier.
-
----
-
-## ✅ COMPLETION STATUS (Nov 11, 2025)
-
-### **ALL 11 PRIORITY ITEMS: 100% COMPLETE**
-
-**High Priority (4/4):** ✅ COMPLETE
-- Item 1: Statistical significance tests ✅
-- Item 2: Break up mega-sentences ✅ (Final fix applied)
-- Item 3: Remove file paths from abstract ✅
-- Item 4: Explicit threat model ✅
-
-**Medium Priority (4/4):** ✅ COMPLETE
-- Item 5: Define fairness metrics ✅
-- Item 6: Add figure interpretations ✅
-- Item 7: Consolidate notation ✅
-- Item 8: Restructure sections ✅
-
-**Low Priority (3/3):** ✅ COMPLETE
-- Item 9: Acronym table ✅
-- Item 10: Code font consistency ✅
-- Item 11: Footnotes for long commands ✅
-
----
-
-## 📊 FINAL QUALITY ASSESSMENT
-
-**Publication Readiness: 95th Percentile** ⭐⭐⭐⭐⭐
-
-| Dimension | Score | Notes |
-|-----------|-------|-------|
-| Technical Contribution | 8.5/10 | Novel verifier triad + risk-aware scheduler |
-| Reproducibility | 9.5/10 | Exemplary (best-in-class artifact management) |
-| Writing Quality | 8.5/10 | ✅ **IMPROVED** from 7.0 - All mega-sentences fixed |
-| Evaluation Rigor | 9.0/10 | ✅ **IMPROVED** from 8.0 - Statistical tests added |
-| Comparison Fairness | 8.5/10 | Honest about limitations |
-| Practical Impact | 8.0/10 | Real problem, deployable solution |
-| Figures/Tables | 8.5/10 | ✅ **IMPROVED** from 7.5 - All interpretations added |
-
-**COMPOSITE SCORE: 8.6/10** → **Strong Accept** (improved from 8.1/10)
-
----
-
-## 🎯 REMAINING PRE-SUBMISSION TASKS
-
-All 11 priority items complete. Only standard submission checklist remains:
-
-1. **Final recompile** - Run `pdflatex` to generate clean PDF
-2. **LaTeX warnings check** - Fix any overfull hboxes
-3. **Bibliography review** - Consider expanding from 17 to 20-25 refs (optional)
-4. **Acknowledgments** - Finalize funding/contributor acknowledgments
-5. **Author photos** - Verify high-res photos exist (✅ already confirmed present)
-6. **External reproducibility test** - Have someone run `make detect && make propose && make verify`
-
-**Estimated time to submission-ready: 2-3 hours** (mostly admin tasks)
diff --git a/paper/access.tex b/paper/access.tex
index 2eecb675..6994816f 100644
--- a/paper/access.tex
+++ b/paper/access.tex
@@ -91,7 +91,7 @@
 
 \IEEEtitleabstractindextext{%
 \begin{abstract}
-Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built \texttt{k8s-auto-fix} to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,338 of 13,373 patched items (99.74\%; auto-fix rate 0.8486; median patch length 9). An optional LLM mode reaches 88.78\% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9$\times$. We release all data and scripts so others can reproduce these results.
+Containerized apps depend on short YAML files that wire images, permissions, and storage together; small typos or copy/paste errors can create outages or security gaps. Most tools only flag the problems, leaving people to guess at safe fixes. We built \texttt{k8s-auto-fix} to close the loop: detect an issue, suggest a small patch, verify it safely, and line it up for application. On a 1,000-manifest replay against a real cluster we fixed every item without rollbacks (1,000/1,000). On a 15,718-detection offline run, deterministic rules plus safety checks accepted 13,589 of 13,656 patched items (99.51\%; auto-fix rate 0.8646; median patch length 8). An optional LLM mode reaches 88.52\% acceptance on a 5,000-manifest corpus. A simple risk-aware scheduler also cuts the worst-case wait for high-risk items by 7.9$\times$. We release all data and scripts so others can reproduce these results.
 \end{abstract}
 
 \begin{IEEEkeywords}
@@ -138,7 +138,7 @@ \section{Importance of the Problem}
 \midrule
 \textbf{Risk Prioritization} & Bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) & Not implemented & FIFO admission queue & Priority queues / toil budgets \\
 \midrule
-\textbf{Evaluation Corpus} & 15{,}718 detections (rules+guardrails: 13{,}338/13{,}373 patched = 99.74\%; auto-fix 0.8486; median ops 9); 1{,}000 live-cluster manifests (100.0\% success); 5{,}000 Grok manifests (88.78\%); 1{,}264 supported manifests (100.00\% rules) & 200 curated manifests (85--92\% accuracy) & Thousands of user manifests (80--95\% mutation acceptance) & Millions of production workloads (no public acceptance \%) \\
+\textbf{Evaluation Corpus} & 15{,}718 detections (rules+guardrails: 13{,}589/13{,}656 patched = 99.51\%; auto-fix 0.8646; median ops 8); 1{,}000 live-cluster manifests (100.0\% success); 5{,}000 Grok manifests (88.52\%); 1{,}264 supported manifests (100.00\% rules) & 200 curated manifests (85--92\% accuracy) & Thousands of user manifests (80--95\% mutation acceptance) & Millions of production workloads (no public acceptance \%) \\
 \midrule
 \textbf{Telemetry} & Policy-level success probabilities, latency histograms, failure taxonomy & Token/cost estimates; no pipeline telemetry & Admission latency $<45$~ms, violation counts & MTTR, incident counts, operator feedback \\
 \midrule
@@ -325,7 +325,7 @@ \subsection{End-to-End Walkthrough on Real Manifests}
 
 \subsection{Research Questions and Findings}
 \begin{enumerate}
-    \item[\textbf{RQ1}] \textbf{Robustness:} The closed loop delivers 88.78\% acceptance on the Grok-5k sweep, 100.00\% on the supported 1,264-manifest corpus in rules mode, and 13{,}338/13{,}373 (99.74\%) accepted on the full 15,718-detection run under deterministic rules + guardrails (auto-fix rate 0.8486 over detections; median ops 9), with no hostPath-related safety failures remaining.
+    \item[\textbf{RQ1}] \textbf{Robustness:} The closed loop delivers 88.78\% acceptance on the Grok-5k sweep, 100.00\% on the supported 1,264-manifest corpus in rules mode, and 13{,}589/13{,}656 (99.51\%) accepted on the full 15,718-detection run under deterministic rules + guardrails (auto-fix rate 0.8646 over detections; median ops 8), with no hostPath-related safety failures remaining.
     \item[\textbf{RQ2}] \textbf{Scheduling Effectiveness:} The bandit ($R p / \mathbb{E}[t]$ + aging + KEV boost) improves risk reduction per hour and reduces top-risk P95 wait from 102.3~hours (FIFO) to 13.0~hours ($7.9\times$).
     \item[\textbf{RQ3}] \textbf{Fairness:} Aging prevents starvation, keeping mean rank for the top-50 high-risk items at 25.5 while still progressing lower-risk items.
     \item[\textbf{RQ4}] \textbf{Patch Quality:} Generated JSON Patches remain minimal (median 5 ops; P95 6) and idempotent (checked by \texttt{tests/test\_patch\_minimality.py}).
@@ -669,7 +669,7 @@ \subsection{Evaluation Results}
 Supported (rules, 1{,}264) & 1337 & 1264/1264 (100.00\%) & 29.0 & 242.0 & 517.8 \\
 Full corpus (rules+guardrails, 15{,}718 detections) & 1337 & 13338/13373 (99.74\%; auto-fix 0.8486 over detections) & -- & -- & -- \\
 Manifest slice (Grok/xAI, 1{,}313) & 1337 & 1313/1313 (100.00\%) & 5095.5 & 138.4 & 904.6 \\
-Grok-5k (Grok/xAI) & 1337 & 4439/5000 (88.78\%) & 5095.5 & 138.4 & 904.6 \\
+Grok-5k (Grok/xAI) & 1337 & 4426/5000 (88.52\%) & 5095.5 & 138.4 & 904.6 \\
 \bottomrule
 \end{tabularx}}
 \endgroup
@@ -898,7 +898,7 @@ \section{Limitations and Mitigations}
 \end{itemize}
 
 \section{Discussion and Future Work}
-The current pipeline achieves 100.0\% live-cluster success (1,000/1,000 stratified manifests) with perfect dry-run/live-apply alignment and surpasses academic baselines (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/live_cluster/results_1k.json}{data/live\_cluster/results\_1k.json}). Across offline corpora, the system delivers 93.54\% acceptance on the 5k supported corpus, 100.00\% on the 1,264-manifest supported slice, 100.00\% on the 1,313-manifest Grok/xAI run, and 88.78\% on Grok-5k overall, while deterministic rules + guardrails now accept 13{,}338 / 13{,}373 patched items (99.74\%; auto-fix rate 0.8486 over 15,718 detections) with median patch ops 9 (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/metrics_latest.json}{data/metrics\_latest.json}). The risk-aware scheduler trims top-risk P95 wait times from 102.3\,h (FIFO) to 13.0\,h (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/scheduler/metrics_sweep_live.json}{data/scheduler/metrics\_sweep\_live.json}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/outputs/scheduler/metrics_schedule_sweep.json}{data/outputs/scheduler/metrics\_schedule\_sweep.json}).
+The current pipeline achieves 100.0\% live-cluster success (1,000/1,000 stratified manifests) with perfect dry-run/live-apply alignment and surpasses academic baselines (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/live_cluster/results_1k.json}{data/live\_cluster/results\_1k.json}). Across offline corpora, the system delivers 93.54\% acceptance on the 5k supported corpus, 100.00\% on the 1,264-manifest supported slice, 100.00\% on the 1,313-manifest Grok/xAI run, and 88.52\% on Grok-5k overall, while deterministic rules + guardrails now accept 13{,}589 / 13{,}656 patched items (99.51\%; auto-fix rate 0.8646 over 15,718 detections) with median patch ops 8 (Table~\ref{tab:eval_summary}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/metrics_latest.json}{data/metrics\_latest.json}). The risk-aware scheduler trims top-risk P95 wait times from 102.3\,h (FIFO) to 13.0\,h (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/scheduler/metrics_sweep_live.json}{data/scheduler/metrics\_sweep\_live.json}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/outputs/scheduler/metrics_schedule_sweep.json}{data/outputs/scheduler/metrics\_schedule\_sweep.json}).
 
 All metrics in this paper are regenerated from the public artifact bundle (\texttt{make reproducible-report}, \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/ARTIFACTS.md}{ARTIFACTS.md}), and the scheduler comparisons we report stem from deterministic queue replays rather than live analyst rotations. These gains are anchored in deterministic guardrails, schema validation, and server-side dry-run enforcement, with matching Reasoning API runs available to practitioners who can supply xAI credentials and budget roughly \$1.22 per 5k sweep under the published pricing (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/data/grok5k_telemetry.json}{data/grok5k\_telemetry.json}, \cite{xai_pricing}). To prevent configuration drift, every accepted patch is surfaced as a pull request through our GitOps helper (\href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/scripts/gitops_writeback.py}{scripts/gitops\_writeback.py}), which records verifier evidence, captures the JSON Patch diff, and requires human approval before merge, mirroring the workflow detailed in \href{https://github.com/bmendonca3/k8s-auto-fix/blob/main/docs/GITOPS.md}{docs/GITOPS.md}.
 
diff --git a/src/eval/metrics.py b/src/eval/metrics.py
index 0ef5628a..060ee93c 100644
--- a/src/eval/metrics.py
+++ b/src/eval/metrics.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import gzip
 import json
 import statistics
 from pathlib import Path
@@ -93,10 +94,17 @@ def run(
 
 def _load_array(path: Path) -> List[Any]:
     try:
-        with path.open("r", encoding="utf-8") as f:
-            data = json.load(f)
+        if path.suffix == ".gz":
+            with gzip.open(path, "rt", encoding="utf-8") as f:
+                data = json.load(f)
+        else:
+            with path.open("r", encoding="utf-8") as f:
+                data = json.load(f)
     except FileNotFoundError:
         return []
+    except Exception as e:
+        typer.echo(f"Error loading {path}: {e}", err=True)
+        return []
     if not isinstance(data, list):
         return []
     return data
@@ -104,4 +112,3 @@ def _load_array(path: Path) -> List[Any]:
 
 if __name__ == "__main__":  # pragma: no cover
     typer.run(run)
-