Skip to content
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
edd2004
Create AKS folder and SKILL.md
julia-yin Feb 25, 2026
a4eab8e
Add azure-kubernetes to skill.json
julia-yin Feb 25, 2026
2cf0363
Update skills.json
julia-yin Feb 25, 2026
da83ce2
Merge branch 'main' into main
julia-yin Feb 25, 2026
59186b0
Fix issue of postgres skill missing from skills.json
julia-yin Feb 25, 2026
ac9301a
Fix skills.json
julia-yin Feb 25, 2026
f24eb8e
Add AKS to architecture.md and testing for AKS skill
julia-yin Feb 27, 2026
16c29c8
Update plugin/skills/azure-kubernetes/SKILL.md
julia-yin Feb 28, 2026
9dc9578
Update SKILL.md
julia-yin Feb 28, 2026
278d7a0
Merge branch 'main' of https://github.com/julia-yin/GitHub-Copilot-fo…
julia-yin Feb 28, 2026
6e2ab85
Merge branch 'main' into main
julia-yin Feb 28, 2026
3f6e3a6
Remove trailing empty lines
julia-yin Feb 28, 2026
3428b30
Merge branch 'main' of https://github.com/julia-yin/GitHub-Copilot-fo…
julia-yin Feb 28, 2026
35c636c
Add AKS to integration test schedule
julia-yin Feb 28, 2026
4995afd
Fix pr.yaml creating leading space
julia-yin Feb 28, 2026
2f940d5
Update SKILL.md
julia-yin Feb 28, 2026
1a92efd
Update triggers.test.ts.snap
julia-yin Feb 28, 2026
d58b49b
Add in missing best practices (ephemeral disk, auto upgrades, reliabi…
julia-yin Mar 2, 2026
fc92679
Add security best practices
julia-yin Mar 2, 2026
f63b19d
Merge branch 'main' into main
julia-yin Mar 2, 2026
b47bed8
Streamline and reduce token count
julia-yin Mar 2, 2026
a2acfc2
Add azure-kubernetes to skills.json
julia-yin Mar 2, 2026
1b8e483
Fix naming issues
julia-yin Mar 2, 2026
2b11b8c
Update trigger and unit tests
julia-yin Mar 2, 2026
4a0a598
Bump azure-prepare version to 1.0.1
julia-yin Mar 3, 2026
7dc8f22
Fix metadata.version
julia-yin Mar 3, 2026
0862041
Add metadata to azure-kubernetes skill
julia-yin Mar 3, 2026
77758cd
Merge branch 'main' into main
julia-yin Mar 3, 2026
a74df39
Apply suggestion from @Copilot
julia-yin Mar 3, 2026
9ac4bda
Apply suggestion from @Copilot
julia-yin Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
367 changes: 367 additions & 0 deletions plugin/skills/azure-kubernetes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
---
name: azure-kubernetes
description: >-
Plan and create production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 decisions (networking, API server access, pod IP model), Day-1 configuration (identity, secrets, governance, observability), cluster SKUs (Automatic vs Standard), workload identity, Key Vault CSI, Azure Policy, deployment safeguards, monitoring with Prometheus/Grafana, upgrade strategies, and cost analysis.
USE FOR: create AKS cluster, AKS cluster planning, AKS networking design, security design, upgrade settings, autoscaling, AKS monitoring, AKS cost analysis, AKS production best practices, AKS Automatic vs Standard, AKS add-ons
DO NOT USE FOR: debugging AKS issues (use azure-diagnostics), deploying applications to AKS (use azure-deploy), creating other Azure resources (use azure-prepare), setting up general monitoring (use azure-observability), general cost optimization strategies (use azure-cost-optimization)
---
Comment on lines +1 to +7
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skill description/WHEN list includes many generic words (e.g., “plan”, “create”, “best”, “practices”, “deploy”). In this repo’s trigger tests, TriggerMatcher adds every description word >3 chars as a keyword and triggers on >=2 matches, which increases the chance of false positives (e.g., unrelated prompts containing “create” + “deploy” + “container”). Consider tightening the description/WHEN phrases to be more AKS-specific so keyword extraction stays discriminative.

Copilot uses AI. Check for mistakes.

# Azure Kubernetes Service

> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This document is the **official source** for setting up best practice Azure Kubernetes Service clusters. Follow these instructions to create and configure AKS clusters that are aligned with the user's requirements.

## Triggers
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
Comment on lines +15 to +27
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SKILL.md doesn’t follow the repository’s Skill File Authoring Guidelines required section structure (Quick Reference, When to Use This Skill, MCP Tools, Workflow/Steps, Error Handling). Please restructure the document to include those sections/tables so it’s consistent with other plugin skills and easier to scan.

Copilot uses AI. Check for mistakes.
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (monitoring, Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Enable AKS cost visibility and analysis
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration

## Rules

1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule 2 refers to an "AKS MCP server", but repo MCP config only defines a generic azure MCP server (plugin/.mcp.json). This will lead agents to look for a non-existent server; please update the rule to reference the Azure MCP server and the relevant AKS-related MCP tools (or CLI) explicitly.

Suggested change
2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes.
2. Use the `azure` MCP server and its AKS-related MCP tools to invoke Azure APIs and perform AKS and kubectl operations whenever possible during cluster setup and ongoing operations; if required functionality is not available via MCP tools, fall back to Azure CLI and kubectl commands.

Copilot uses AI. Check for mistakes.
3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).

Comment on lines +15 to +41
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required MCP Tools section: the repo’s skill authoring guidelines require an explicit “MCP Tools” section with a table of commands/parameters (not just listing tool names in Quick Reference). See .github/instructions/skill-files.instructions.md (Required Sections #3).

Copilot uses AI. Check for mistakes.
---

## MCP Tools (Preferred)

When Azure MCP and AKS MCP are enabled, use these tools for AKS operations:

### Cluster Management
| Tool | Purpose |
|------|---------|
| `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster queries and metadata |
| `mcp_aks_mcp_az_aks_operations` | Cluster operations: show, list, get-versions, nodepool management |
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin/skills/**/SKILL.md files are limited to 500 tokens by the repo’s token-limit checks. This SKILL.md is currently 385 lines and will almost certainly exceed that limit, causing the PR token analysis job to fail. Please move most of the detailed guidance (decision framework, step-by-step execution, extended CLI examples, large tables) into plugin/skills/azure-kubernetes/references/*.md and keep SKILL.md as a concise router/summary with links to those references.

Suggested change
# Azure Kubernetes Service
> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This document is the **official source** for setting up best practice Azure Kubernetes Service clusters. Follow these instructions to create and configure AKS clusters that are aligned with the user's requirements.
## Triggers
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (monitoring, Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Enable AKS cost visibility and analysis
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration
## Rules
1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes.
3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).
---
## MCP Tools (Preferred)
When Azure MCP and AKS MCP are enabled, use these tools for AKS operations:
### Cluster Management
| Tool | Purpose |
|------|---------|
| `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster queries and metadata |
| `mcp_aks_mcp_az_aks_operations` | Cluster operations: show, list, get-versions, nodepool management |
# Azure Kubernetes Service (AKS)
This skill helps you **plan and create production-ready AKS clusters** using Azure and AKS MCP tools. It is a concise router to detailed guidance stored under `plugin/skills/azure-kubernetes/references/`.
---
## Quick Reference
| Aspect | Summary | Details / References |
|-------------------|-----------------------------------------------------------|----------------------|
| MCP tools | Prefer Azure and AKS MCP tools over raw CLI commands. | See [MCP tools](#mcp-tools) and `./references/mcp-aks.md`. |
| Cluster planning | Cover Day-0/Day-1 decisions, SKUs, and networking. | `./references/architecture-and-skus.md` |
| Security | Identity, workload identity, Key Vault CSI, governance. | `./references/security-and-governance.md` |
| Operations | Upgrades, autoscaling, observability, cost visibility. | `./references/operations-and-costs.md` |
| Troubleshooting | Common MCP / AKS failures and recovery steps. | `./references/troubleshooting.md` |
> ⚠️ **Warning:** Use this skill **only** for AKS cluster planning and configuration. For app deployment, diagnostics, or generic Azure setup, route to the appropriate skills listed in the description frontmatter.
---
## When to Use This Skill
Activate this skill when the user wants to:
- Create a new AKS cluster (dev, test, or production).
- Plan AKS cluster configuration for production or business-critical workloads.
- Design AKS networking (API server access, pod IP model, outbound/egress).
- Set up AKS identity and secrets management (managed identity, workload identity, Key Vault CSI).
- Configure AKS governance (Azure Policy, Deployment Safeguards, baseline guardrails).
- Enable AKS observability (monitoring, Prometheus, Grafana, logging).
- Define AKS upgrade, node image, and patching strategy.
- Analyze AKS costs or choose between **AKS Automatic** and **AKS Standard** SKUs.
Do **not** use this skill for:
- Debugging AKS runtime issues → use `azure-diagnostics`.
- Deploying or updating workloads on AKS → use `azure-deploy`.
- Creating non-AKS Azure resources or generic landing zones → use `azure-prepare`.
- Platform-wide monitoring or cost optimization → use `azure-observability` or `azure-cost-optimization`.
---
## MCP Tools
When Azure MCP and AKS MCP are enabled, prefer these tools for AKS operations:
| Tool | Scope / Purpose | Reference |
|-------------------------------------|-------------------------------------------------------------|------------------------------------|
| `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster discovery and metadata. | `./references/mcp-aks.md` |
| `mcp_aks_mcp_az_aks_operations` | Cluster operations (show, list, versions, nodepools). | `./references/mcp-aks.md` |
| `mcp_aks_mcp_kubectl` | Cluster-level `kubectl` interactions when required. | `./references/mcp-kubectl.md` |
> 💡 **Tip:** Use AKS MCP tools for **read/write** operations first. Fall back to `az aks` CLI only when an operation is not exposed via MCP, and record this in the reasoning.
---
## Workflow/Steps
1. **Clarify requirements**
- Capture environment (dev/test/prod), region, availability, scale, and compliance needs.
- Identify network constraints (private clusters, IP strategy, egress model).
- See `./references/requirements-and-questionnaire.md`.
2. **Plan cluster architecture**
- Choose between **AKS Automatic** and **AKS Standard** SKUs.
- Decide on network model, API server access, and identity strategy.
- See `./references/architecture-and-skus.md`.
3. **Design security and governance**
- Plan workload identity, Key Vault CSI, RBAC, and Azure Policy baselines.
- See `./references/security-and-governance.md`.
4. **Define operations and observability**
- Configure monitoring, logging, upgrade strategy, autoscaling, and cost visibility.
- See `./references/operations-and-costs.md`.
5. **Summarize and validate**
- Present a concise plan (decisions + rationale) before suggesting any CLI/MCP commands.
---
## Error Handling
Use this table to route common issues to the right remediation steps:
| Error / Symptom | Likely Cause / Next Step | Reference |
|------------------------------------------------------|--------------------------------------------------------------|------------------------------------|
| MCP tool call fails or times out | Check credentials, subscription, and AKS MCP configuration. | `./references/troubleshooting.md` |
| Cluster creation blocked by policy or quota | Review Azure Policy, quotas, and regional SKU availability. | `./references/troubleshooting.md` |
| Networking settings conflict (IP exhaustion, egress) | Revisit IP planning and egress design. | `./references/architecture-and-skus.md` |
| Identity / secrets not working as expected | Validate workload identity and Key Vault CSI configuration. | `./references/security-and-governance.md` |
For detailed step-by-step remediation flows and CLI examples, see `./references/troubleshooting.md`.

Copilot uses AI. Check for mistakes.
| `mcp_aks_mcp_get_aks_vmss_info` | VMSS configuration for node pools |

### kubectl Operations
| Tool | Purpose |
|------|---------|
| `mcp_aks_mcp_kubectl_resources` | Get/describe pods, deployments, services |
| `mcp_aks_mcp_kubectl_diagnostics` | Logs, events, top, exec, cp operations |
| `mcp_aks_mcp_kubectl_cluster` | Cluster info, API resources, explain |
| `mcp_aks_mcp_kubectl_config` | Config contexts, auth checks |

### Networking
| Tool | Purpose |
|------|---------|
| `mcp_aks_mcp_az_network_resources` | VNet, NSG, route tables, subnets, load balancers |

---

## CLI Fallback

```bash
# List AKS clusters
az aks list --output table

# Show cluster details
az aks show --name CLUSTER --resource-group RG

# Get available Kubernetes versions
az aks get-versions --location LOCATION --output table

# Create AKS Automatic cluster
az aks create --name CLUSTER --resource-group RG --sku automatic \
--network-plugin azure --network-plugin-mode overlay \
--enable-oidc-issuer --enable-workload-identity

# Create AKS Standard cluster
az aks create --name CLUSTER --resource-group RG \
--node-count 3 --zones 1 2 3 \
--network-plugin azure --network-plugin-mode overlay \
--enable-cluster-autoscaler --min-count 1 --max-count 10
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI example for creating an AKS Standard cluster omits --enable-oidc-issuer / --enable-workload-identity, even though the skill recommends Workload Identity as the preferred baseline elsewhere in the doc (and the Automatic example includes it). Either add the flags to the Standard example or explicitly call out when/why they should be enabled separately.

Suggested change
--enable-cluster-autoscaler --min-count 1 --max-count 10
--enable-cluster-autoscaler --min-count 1 --max-count 10 \
--enable-oidc-issuer --enable-workload-identity

Copilot uses AI. Check for mistakes.

# Get credentials
az aks get-credentials --name CLUSTER --resource-group RG

# List node pools
az aks nodepool list --cluster-name CLUSTER --resource-group RG --output table

# Enable monitoring
az aks enable-addons --name CLUSTER --resource-group RG \
--addons monitoring --workspace-resource-id WORKSPACE_ID
```

---

## Overview
This skill guides a user through planning and creating an Azure Kubernetes Service (AKS) cluster using public best practices for:
- cluster mode selection (Automatic vs Standard),
- networking (API server access, egress, pod IP model),
- identity (Microsoft Entra + Workload Identity),
- secrets management (Key Vault CSI),
- governance (Azure Policy + Deployment Safeguards),
- observability (Azure Monitor, Managed Prometheus, Managed Grafana),
- upgrades/patching (auto-upgrade channels, maintenance windows),
- cost visibility (AKS Cost Analysis).

References are public and included at the end.

---

## When to Use
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section header "## When to Use" is inconsistent with the repo’s common "## When to Use This Skill" heading (e.g., plugin/skills/appinsights-instrumentation/SKILL.md:18 and plugin/skills/azure-messaging/SKILL.md:25). Aligning the heading improves consistency and navigation across skills.

Suggested change
## When to Use
## When to Use This Skill

Copilot uses AI. Check for mistakes.
Use this skill when a user asks:
- “What do I need to decide before creating an AKS cluster?”
- “Create an AKS cluster plan/design for production”
- “AKS networking: overlay vs pod subnet vs node subnet”
- “How do I set up Workload Identity / Key Vault CSI / Azure Policy?”
- “How do I configure upgrades, patching, and observability on AKS?”

---

## Goals / Outcomes
1. Produce a **recommended AKS cluster configuration** based on user requirements (security, scale, connectivity, compliance).
2. Provide a **Day-0 checklist** (decisions that are hard to change later, like networking and API server exposure).
3. Provide a **Day-1 checklist** (baseline add-ons and settings for production readiness).
4. Optionally output a **command/IaC skeleton** (placeholders only unless user provides values).

---

## Required Inputs (Ask only what’s needed)
If the user is unsure, use safe defaults.

### 1. Environment & scale
- Environment: `dev/test` or `production`
- Region(s) + availability zones needed?
- Expected scale: node count / cluster count (single vs multi)

### 2. Networking requirements (Day-0 critical)
- API server access:
- Public API server or Private cluster?
- Pod IP model:
- Do pods need **direct routable IPs in the VNet**?
- Egress control:
- Default outbound, NAT Gateway, or UDR + firewall/NVA?

### 3. Identity & security posture
- Microsoft Entra RBAC required?
- Need pod-to-Azure access with **Workload Identity**?
- Regulated environment needs (private cluster, policy enforcement, restricted egress)?

---

## Outputs (What the Skill Produces)
### Primary Output: “AKS Setup Plan”
1. Cluster type recommendation (Automatic vs Standard)
2. Networking plan (control plane access, egress choice, pod IP model)
3. Node pools + scaling plan
4. Security baseline (identity, secrets, policy)
5. Observability baseline (metrics/logs/dashboards/alerts)
6. Upgrade & patching plan
7. Cost controls baseline
8. Day-0 checklist + Day-1 checklist

### Optional Outputs
- CLI skeleton (placeholders)
- IaC outline (Bicep/Terraform module list)

---

## Decision Framework (Defaults when user is unsure)

### 1. Cluster Type

| Cluster SKU | Automatic | Standard |
|---------|-----------|----------|
| Best for | Production defaults, faster setup | Maximum control, custom requirements |
| Node provisioning | Auto (NAP) | Manual node pools |
| Autoscaling | Pre-configured | Configure manually |
| Networking | Azure CNI Overlay (fixed) | Choose model |
| Azure Policy | Enabled by default | Configure separately |
| Monitoring | Azure Monitor enabled | Configure separately |
| Deployment Safeguards | Enabled (Warning) | Configure separately |
| Node OS upgrades | Auto-configured | Configure channel |

**Recommendation**: Default to **AKS Automatic** unless you need:
- Custom networking (kubenet, Azure CNI with pod subnet)
- Windows node pools
- Specific node pool configurations not supported by NAP
- Full control over autoscaling behavior

### 2. Pod Networking Model (Key Day-0 decision)
- Prefer **Azure CNI Overlay** for scalability and conserving VNet IP space.
Docs: https://learn.microsoft.com/azure/aks/azure-cni-overlay

If pods must be directly addressable/routable in your VNet, use VNet-based Azure CNI options:
- Azure CNI with pod subnet or node subnet models (see Azure CNI overlay + related networking docs)

### 3. Dataplane / Network Policy
- Consider **Azure CNI powered by Cilium** for eBPF-based performance and policy/observability features.
Docs: https://learn.microsoft.com/azure/aks/azure-cni-powered-by-cilium

### 4. Workload Identity (Preferred for pod-to-Azure auth)
- Prefer **Microsoft Entra Workload ID** for workloads calling Azure services without secrets.
Docs: https://learn.microsoft.com/azure/aks/workload-identity-overview

### 5. Secrets
- Prefer Azure Key Vault via **Secrets Store CSI Driver** provider.
Docs: https://learn.microsoft.com/azure/aks/csi-secrets-store-driver

### 6. Governance
- Enable **Azure Policy** (prereq) and **Deployment Safeguards** for workload best-practice enforcement.
Docs: Deployment Safeguards: https://learn.microsoft.com/azure/aks/deployment-safeguards

### 7. Observability
- Use Azure Monitor for AKS monitoring enablement (logs + Prometheus + Grafana).
Docs: https://learn.microsoft.com/azure/azure-monitor/containers/kubernetes-monitoring-enable
Prometheus overview: https://learn.microsoft.com/azure/azure-monitor/metrics/prometheus-metrics-overview

### 8. Upgrades & Patching
- Establish an upgrade strategy and ensure workloads are upgrade-safe (PDBs, probes, etc.).
Docs: AKS patch/upgrade guidance: https://learn.microsoft.com/azure/architecture/operator-guides/aks/aks-upgrade-practices

For node OS patching:
- Node OS auto-upgrade channels: https://learn.microsoft.com/azure/aks/auto-upgrade-node-os-image
For cluster version auto-upgrades:
- Cluster auto-upgrade channels: https://learn.microsoft.com/azure/aks/auto-upgrade-cluster

---

## Step-by-Step Execution (Agent Behavior)

### Step 1 — Classify scenario
Identify environment, compliance posture, region/AZ needs, scale, and workload types.

### Step 2 — Recommend cluster type
Recommend AKS Automatic or Standard with short rationale.
- AKS Automatic intro: https://learn.microsoft.com/azure/aks/intro-aks-automatic

### Step 3 — Lock networking (Day-0)
Ask:
- Public vs Private API server?
- Pod IP model: overlay vs VNet-routable requirement?
- Egress: LB vs NAT Gateway vs UDR+Firewall?

Reference: Azure CNI Overlay setup: https://learn.microsoft.com/azure/aks/azure-cni-overlay

### Step 4 — Node pools and compute
Recommend:
- system node pool + user node pools
- separate pools for GPU/batch/stateful if applicable
- capacity planning considerations (max pods per node affects IP planning, upgrades)

### Step 5 — Configure autoscaling
Recommend:
- HPA for pods
- Cluster Autoscaler / node scaling strategy
- If user wants higher automation, discuss Node Auto Provisioning where available (if asked)

### Step 6 — Identity and secrets
- Enable Workload Identity:
https://learn.microsoft.com/azure/aks/workload-identity-overview
- Use Key Vault CSI Driver:
https://learn.microsoft.com/azure/aks/csi-secrets-store-driver

### Step 7 — Policy & safeguards
- Turn on Azure Policy and Deployment Safeguards (warn/enforce).
Docs: https://learn.microsoft.com/azure/aks/deployment-safeguards

### Step 8 — Observability baseline
- Enable monitoring using Azure Monitor guidance:
https://learn.microsoft.com/azure/azure-monitor/containers/kubernetes-monitoring-enable
- Managed Prometheus overview:
https://learn.microsoft.com/azure/azure-monitor/metrics/prometheus-metrics-overview

### Step 9 — Upgrades & patching
- Define upgrade approach:
https://learn.microsoft.com/azure/architecture/operator-guides/aks/aks-upgrade-practices
- Configure node OS upgrade channels:
https://learn.microsoft.com/azure/aks/auto-upgrade-node-os-image
- Configure cluster autoupgrade channels:
https://learn.microsoft.com/azure/aks/auto-upgrade-cluster

### Step 10 — Cost visibility
- Enable AKS cost analysis add-on (OpenCost-based):
https://learn.microsoft.com/azure/aks/cost-analysis

Return a final output with:
- recommended config
- Day-0 checklist
- Day-1 checklist
- optional command/IaC skeleton

---

## Guardrails / Safety
- Do not request or output secrets (tokens, keys, subscription IDs).
- If requirements are ambiguous, propose 2–3 safe options with tradeoffs and choose a conservative default.
- Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades.
- If user asks for actions that require privileged access, provide a plan and commands with placeholders.

---

## Quality Bar
A high-quality answer:
- flags Day-0 irreversible choices (networking, API server access),
- includes identity/secrets/policy defaults (Workload ID + Key Vault CSI + safeguards),
- includes observability baseline,
- includes upgrade/patch plan,
- includes cost visibility.

---

## Quick Reference

### Common AKS Commands

| Task | Command |
|------|---------|
| List clusters | `az aks list -o table` |
| Show cluster | `az aks show -n CLUSTER -g RG` |
| Get credentials | `az aks get-credentials -n CLUSTER -g RG` |
| List node pools | `az aks nodepool list --cluster-name CLUSTER -g RG` |
| Scale node pool | `az aks nodepool scale --cluster-name CLUSTER -g RG -n POOL --node-count 5` |
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Quick Reference section is currently focused on command tables, but the repo’s SKILL.md authoring guideline calls for a summary table of key properties (e.g., MCP tools, CLI commands, best for) to make the skill scannable. Consider adding a short top-level summary table here (and/or moving Quick Reference earlier) to match that expectation.

Copilot uses AI. Check for mistakes.
| Upgrade cluster | `az aks upgrade -n CLUSTER -g RG --kubernetes-version VERSION` |

### kubectl Quick Commands

| Task | Command |
|------|---------|
| Get pods | `kubectl get pods -A` |
| Get nodes | `kubectl get nodes -o wide` |
| Describe pod | `kubectl describe pod POD -n NAMESPACE` |
| View pod logs | `kubectl logs POD -n NAMESPACE --tail=100` |
| Check events | `kubectl get events --sort-by='.lastTimestamp'` |
| Top nodes | `kubectl top nodes` |

---

## Error Handling

### Common AKS Issues

| Issue | Symptom | Resolution |
|-------|---------|------------|
| **Cluster creation fails** | Quota exceeded | Request quota increase or use different VM SKU |
| **Node not ready** | NotReady status | Check kubelet logs, node conditions |
| **Pod pending** | Insufficient resources | Scale node pool or check resource requests |
| **Image pull failed** | ImagePullBackOff | Check ACR access, image name, network |
| **API server unreachable** | Connection refused | Check authorized IPs, private cluster config |

---

## References (Public)
- AKS Automatic overview: https://learn.microsoft.com/azure/aks/intro-aks-automatic
- Azure CNI Overlay (setup and parameters): https://learn.microsoft.com/azure/aks/azure-cni-overlay
- Azure CNI powered by Cilium: https://learn.microsoft.com/azure/aks/azure-cni-powered-by-cilium
- Microsoft Entra Workload ID on AKS: https://learn.microsoft.com/azure/aks/workload-identity-overview
- Key Vault provider for Secrets Store CSI Driver: https://learn.microsoft.com/azure/aks/csi-secrets-store-driver
- Deployment Safeguards: https://learn.microsoft.com/azure/aks/deployment-safeguards
- Enable AKS monitoring (Prometheus + Grafana + logs): https://learn.microsoft.com/azure/azure-monitor/containers/kubernetes-monitoring-enable
- Azure Monitor managed Prometheus overview: https://learn.microsoft.com/azure/azure-monitor/metrics/prometheus-metrics-overview
- AKS patch & upgrade practices (Day-2 guidance): https://learn.microsoft.com/azure/architecture/operator-guides/aks/aks-upgrade-practices
- Node OS auto-upgrade channels: https://learn.microsoft.com/azure/aks/auto-upgrade-node-os-image
- Cluster auto-upgrade channels: https://learn.microsoft.com/azure/aks/auto-upgrade-cluster
- AKS cost analysis (OpenCost-based): https://learn.microsoft.com/azure/aks/cost-analysis
Loading
Loading