Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b10b649
Create AKS folder and SKILL.md
julia-yin Feb 25, 2026
8ff360f
Add azure-kubernetes to skill.json
julia-yin Feb 25, 2026
cc20821
Update skills.json
julia-yin Feb 25, 2026
769631d
Fix issue of postgres skill missing from skills.json
julia-yin Feb 25, 2026
c37988c
Fix skills.json
julia-yin Feb 25, 2026
0a6efd5
Add AKS to architecture.md and testing for AKS skill
julia-yin Feb 27, 2026
eb5c9a2
Update SKILL.md
julia-yin Feb 28, 2026
31523df
Update plugin/skills/azure-kubernetes/SKILL.md
julia-yin Feb 28, 2026
be258cd
Remove trailing empty lines
julia-yin Feb 28, 2026
295a9ed
Add AKS to integration test schedule
julia-yin Feb 28, 2026
2c1d3a2
Fix pr.yaml creating leading space
julia-yin Feb 28, 2026
7d79459
Update SKILL.md
julia-yin Feb 28, 2026
afc8b05
Update triggers.test.ts.snap
julia-yin Feb 28, 2026
0d6e6ef
Add in missing best practices (ephemeral disk, auto upgrades, reliabi…
julia-yin Mar 2, 2026
2e15f00
Add security best practices
julia-yin Mar 2, 2026
bd1e5e5
Streamline and reduce token count
julia-yin Mar 2, 2026
8d8b187
Add azure-kubernetes to skills.json
julia-yin Mar 2, 2026
cfc9cc0
Fix naming issues
julia-yin Mar 2, 2026
9d12083
Update trigger and unit tests
julia-yin Mar 2, 2026
14540ae
Bump azure-prepare version to 1.0.1
julia-yin Mar 3, 2026
f6fcda1
Fix metadata.version
julia-yin Mar 3, 2026
f6dc996
Add metadata to azure-kubernetes skill
julia-yin Mar 3, 2026
453c478
Apply suggestion from @Copilot
julia-yin Mar 3, 2026
0436b19
Apply suggestion from @Copilot
julia-yin Mar 3, 2026
c65c53f
Bump azure-prepare skill version
julia-yin Mar 4, 2026
1bf5c35
Revert pr.yml
julia-yin Mar 5, 2026
3e578e0
Add license to AKS skill
julia-yin Mar 5, 2026
9144b03
Add back azure-prepare description
julia-yin Mar 5, 2026
3294051
Bump azure-prepare version to 1.0.4
julia-yin Mar 5, 2026
4c6c56a
Add back license to azure-prepare
julia-yin Mar 5, 2026
02fa831
Fix description
julia-yin Mar 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 128 additions & 0 deletions plugin/skills/azure-kubernetes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
name: azure-kubernetes
license: MIT
metadata:
author: Microsoft
version: "1.0.0"
description: "Plan and create production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 decisions and Day-1 configuration, cluster SKUs (Automatic vs Standard), security, monitoring, reliability/performance best practices, upgrades, and networking. WHEN: create AKS cluster, plan AKS configuration, design AKS networking, AKS Automatic vs Standard, AKS security, AKS upgrade strategy, AKS autoscaling, AKS monitoring setup, AKS cost analysis, Day-0 checklist."
---
Comment on lines +1 to +8
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skill description/WHEN list includes many generic words (e.g., “plan”, “create”, “best”, “practices”, “deploy”). In this repo’s trigger tests, TriggerMatcher adds every description word >3 chars as a keyword and triggers on >=2 matches, which increases the chance of false positives (e.g., unrelated prompts containing “create” + “deploy” + “container”). Consider tightening the description/WHEN phrases to be more AKS-specific so keyword extraction stays discriminative.

Copilot uses AI. Check for mistakes.

# Azure Kubernetes Service

> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This skill produces a **recommended AKS cluster configuration** based on user requirements, distinguishing **Day-0 decisions** (networking, API server — hard to change later) from **Day-1 features** (can enable post-creation). See [CLI reference](./references/cli-reference.md) for commands.

## Quick Reference
| Property | Value |
|----------|-------|
| Best for | AKS cluster planning and Day-0 decisions |
| MCP Tools | `mcp_azure_mcp_aks`, `mcp_aks_mcp_az_aks_operations` |
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCP tool name mcp_aks_mcp_az_aks_operations doesn’t match the mcp_azure_mcp_* naming used elsewhere in this repo, and it isn’t referenced anywhere else. If this is meant to be an Azure MCP tool, rename it to the correct tool identifier (or remove it) so the skill doesn’t instruct agents to call a non-existent tool.

Suggested change
| MCP Tools | `mcp_azure_mcp_aks`, `mcp_aks_mcp_az_aks_operations` |
| MCP Tools | `mcp_azure_mcp_aks`, `mcp_azure_mcp_az_aks_operations` |

Copilot uses AI. Check for mistakes.
| CLI | `az aks create`, `az aks show` |
Comment on lines +19 to +21
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcp_aks_mcp_az_aks_operations doesn’t appear anywhere else in the repo and doesn’t match the existing MCP tool naming pattern (mcp_azure_mcp_*). If this tool name is a typo, agents will attempt to call a non-existent tool. Please confirm the actual AKS MCP tool names and update this list accordingly.

Copilot uses AI. Check for mistakes.
| Related skills | azure-diagnostics (troubleshooting), azure-deploy (app deployment) |
Comment on lines +19 to +22
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCP tool names in Quick Reference (mcp_azure_mcp_aks, mcp_aks_mcp_az_aks_operations) don’t match the MCP tool naming used elsewhere in this repo (e.g., aks is referenced as the dedicated AKS MCP tool in azure-resource-lookup). These identifiers also aren’t referenced anywhere else in the repo, so they’re likely incorrect. Please align this row with the actual tool name(s) used by the AKS MCP server (or remove the row if tool support isn’t available).

Copilot uses AI. Check for mistakes.

Comment on lines +16 to +23
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill authoring guidelines require a dedicated MCP Tools section with a table of available MCP commands and parameters. Currently MCP tools are only listed inline in the Quick Reference table; add the required MCP Tools section/table so consumers can see which tool to call and with what parameters.

Copilot uses AI. Check for mistakes.
## When to Use This Skill
Comment on lines +16 to +24
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new skill file is missing some required sections per the repo’s SKILL authoring guidelines: there should be a dedicated MCP Tools section (with a table of commands/parameters) and an Error Handling section (table of common errors, messages, remediation). Adding these sections will keep the skill consistent with other Azure skills and improve agent behavior.

Copilot uses AI. Check for mistakes.
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
Comment on lines +16 to +28
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SKILL.md doesn’t follow the repository’s Skill File Authoring Guidelines required section structure (Quick Reference, When to Use This Skill, MCP Tools, Workflow/Steps, Error Handling). Please restructure the document to include those sections/tables so it’s consistent with other plugin skills and easier to scan.

Copilot uses AI. Check for mistakes.
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (monitoring, Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Enable AKS cost visibility and analysis
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration

## Rules
1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule 2 refers to an "AKS MCP server", but repo MCP config only defines a generic azure MCP server (plugin/.mcp.json). This will lead agents to look for a non-existent server; please update the rule to reference the Azure MCP server and the relevant AKS-related MCP tools (or CLI) explicitly.

Suggested change
2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes.
2. Use the `azure` MCP server and its AKS-related MCP tools to invoke Azure APIs and perform AKS and kubectl operations whenever possible during cluster setup and ongoing operations; if required functionality is not available via MCP tools, fall back to Azure CLI and kubectl commands.

Copilot uses AI. Check for mistakes.
3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).
Comment on lines +24 to +41
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skill authoring guidelines require an explicit MCP Tools section with a table of available commands + parameters. This skill only lists MCP tools in Quick Reference, which makes it hard for agents to know what to call and with which parameters. Add a dedicated ## MCP Tools section (table) before the workflow, consistent with other skills.

Copilot uses AI. Check for mistakes.

Comment on lines +16 to +42
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required MCP Tools section: the repo’s skill authoring guidelines require an explicit “MCP Tools” section with a table of commands/parameters (not just listing tool names in Quick Reference). See .github/instructions/skill-files.instructions.md (Required Sections #3).

Copilot uses AI. Check for mistakes.

## Required Inputs (Ask only what’s needed)
If the user is unsure, use safe defaults.
- Cluster environment: dev/test or production
- Region(s), availability zones, preferred node VM sizes
- Expected scale (node/cluster count, workload size)
- Networking requirements (API server access, pod IP model, ingress/egress control)
- Security and identity requirements, including image registry
- Upgrade and observability preferences
- Cost constraints

## Workflow

### 1. Cluster Type
- **AKS Automatic** (default): Best for most production workloads, provides a curated experience with pre-configured best practices for security, reliability, and performance. Use unless you have specific custom requirements for networking, autoscaling, or node pool configurations not supported by NAP.
- **AKS Standard**: Use if you need full control over cluster configuration, will require additional overhead to setup and manage.

### 2. Networking (Pod IP, Egress, Ingress, Dataplane)

**Pod IP Model** (Key Day-0 decision):
- **Azure CNI Overlay** (recommended): pod IPs from private overlay range, not VNet-routable, scales to large clusters and good for most workloads
- **Azure CNI (VNet-routable)**: pod IPs directly from VNet (pod subnet or node subnet), use when pods must be directly addressable from VNet or on-prem
- Docs: https://learn.microsoft.com/azure/aks/azure-cni-overlay

**Dataplane & Network Policy**:
- **Azure CNI powered by Cilium** (recommended): eBPF-based for high-performance packet processing, network policies, and observability

**Egress**:
- **Static Egress Gateway** for stable, predictable outbound IPs
- For restricted egress: UDR + Azure Firewall or NVA

**Ingress**:
- **App Routing addon with Gateway API** — recommended default for HTTP/HTTPS workloads
- **Istio service mesh with Gateway API** — for advanced traffic management, mTLS, canary deployments
- **Application Gateway for Containers** — for L7 load balancing with WAF integration

**DNS**:
- Enable **LocalDNS** on all node pools for reliable, performant DNS resolution

### 3. Security
- Use **Microsoft Entra ID** everywhere (control plane, Workload Identity for pods, node access). Avoid static credentials.
- Azure Key Vault via **Secrets Store CSI Driver** for secrets
- Enable **Azure Policy** + **Deployment Safeguards**
- Enable **Encryption at rest** for etcd/API server; **in-transit** for node-to-node
- Allow only signed, policy-approved images (Azure Policy + Ratify), prefer **Azure Container Registry**
- **Isolation**: Use namespaces, network policies, scoped logging

### 4. Observability
- Use Azure Monitor and Container Insights for AKS monitoring enablement (logs + Prometheus + Grafana).

### 5. Upgrades & Patching
- Configure **Maintenance Windows** for controlled upgrade timing
- Enable **auto-upgrades** for cluster and node OS to stay up-to-date with security patches and Kubernetes versions
- Consider **LTS versions** for enterprise stability (2-year support) by upgrading your cluster to the AKS Premium tier
- **Multi-cluster upgrades**: Use **AKS Fleet Manager** for staged rollout across test → production clusters

### 6. Performance
- Use **Ephemeral OS disks** (`--node-osdisk-type Ephemeral`) for faster node startup
- Select **Azure Linux** as node OS (smaller footprint, faster boot)
- Enable **KEDA** for event-driven autoscaling beyond HPA

### 7. Node Pools & Compute
- **Dedicated system node pool**: At least 2 nodes, tainted for system workloads only (`CriticalAddonsOnly`)
- Enable **Node Auto Provisioning (NAP)** on all pools for cost savings and responsive scaling
- Use **latest generation SKUs (v5/v6)** for host-level optimizations
- **Avoid B-series VMs** — burstable SKUs cause performance/reliability issues
- Use SKUs with **at least 4 vCPUs** for production workloads
- Set **topology spread constraints** to distribute pods across hosts/zones per SLO

### 8. Reliability
- Deploy across **3 Availability Zones** (`--zones 1 2 3`)
- Use **Standard tier** for zone-redundant control plane + 99.95% SLA for API server availability
- Enable **Microsoft Defender for Containers** for runtime protection
- Configure **PodDisruptionBudgets** for all production workloads
- Use **topology spread constraints** to ensure pod distribution across failure domains

### 9. Cost Controls
- Use **Spot node pools** for batch/interruptible workloads (up to 90% savings)
- **Stop/Start** dev/test clusters: `az aks stop/start`
- Consider **Reserved Instances** or **Savings Plans** for steady-state workloads

## Guardrails / Safety
- Do not request or output secrets (tokens, keys, subscription IDs).
- If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2–3 safe options with tradeoffs and choose a conservative default.
- Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance.
- If user asks for actions that require privileged access, provide a plan and commands with placeholders.
Comment on lines +119 to +128
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skill authoring guidelines call for an Error Handling section (table of errors/messages/remediation). This SKILL.md currently ends with Guardrails/Safety but never defines expected failure modes (e.g., missing permissions, quota, unsupported SKU/region features) or what the agent should do. Add a ## Error Handling section with a small table.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +128
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill authoring guidelines require an Error Handling section with a table of errors/messages/remediation. This skill currently ends with Guardrails/Safety but does not define error cases or remediation steps; add the required Error Handling table.

Copilot uses AI. Check for mistakes.
33 changes: 33 additions & 0 deletions plugin/skills/azure-kubernetes/references/cli-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# CLI Reference for AKS

```bash
# List AKS clusters
az aks list --output table

# Show cluster details
az aks show --name <cluster-name> --resource-group <resource-group>

# Get available Kubernetes versions
az aks get-versions --location <location> --output table

# Create AKS Automatic cluster
az aks create --name <cluster-name> --resource-group <resource-group> --sku automatic \
--network-plugin azure --network-plugin-mode overlay \
--enable-oidc-issuer --enable-workload-identity

# Create AKS Standard cluster
az aks create --name <cluster-name> --resource-group <resource-group> \
--node-count 3 --zones 1 2 3 \
--network-plugin azure --network-plugin-mode overlay \
--enable-cluster-autoscaler --min-count 1 --max-count 10

# Get credentials
az aks get-credentials --name <cluster-name> --resource-group <resource-group>

# List node pools
az aks nodepool list --cluster-name <cluster-name> --resource-group <resource-group> --output table

# Enable monitoring
az aks enable-addons --name <cluster-name> --resource-group <resource-group> \
--addons monitoring --workspace-resource-id <workspace-resource-id>
```
2 changes: 1 addition & 1 deletion plugin/skills/azure-prepare/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: "Prepare Azure apps for deployment (infra Bicep/Terraform, azure.ya
license: MIT
metadata:
author: Microsoft
version: "1.0.3"
version: "1.0.4"
---

# Azure Prepare
Expand Down
38 changes: 33 additions & 5 deletions plugin/skills/azure-prepare/references/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,46 @@ Select hosting stack and map components to Azure services.
| Workflow / orchestration | | ✓✓ (Durable Functions + DTS) | |
| Minimal ops overhead | | ✓✓ | ✓ |

### Container Hosting: Container Apps vs AKS

| Factor | Container Apps | AKS |
|--------|:--------------:|:---:|
| **Scale to zero** | ✓✓ | |
| **Kubernetes API access** | | ✓✓ |
| **Custom operators/CRDs** | | ✓✓ |
| **Service mesh** | Dapr (built-in) | Istio, Cilium |
| **GPU workloads** | | ✓✓ |
| **Best for** | Microservices, event-driven | Full K8s control, complex workloads |

#### When to Use Container Apps
- Microservices without Kubernetes complexity
- Event-driven workloads (KEDA built-in)
- Need scale-to-zero for cost optimization
- Teams without Kubernetes expertise

#### When to Use AKS
- Need Kubernetes API/kubectl access
- Require custom operators or CRDs
- Service mesh requirements (Istio, Linkerd)
- GPU/ML workloads
- Complex networking or multi-tenant architectures

> **AKS Planning:** For AKS SKU selection (Automatic vs Standard), networking, identity, scaling, and security configuration, invoke the **azure-kubernetes** skill.

## Service Mapping

### Hosting

| Component Type | Primary Service | Alternatives |
|----------------|-----------------|--------------|
| SPA Frontend | Static Web Apps | Blob + CDN |
| SSR Web App | Container Apps | App Service |
| REST/GraphQL API | Container Apps | App Service, Functions |
| Background Worker | Container Apps | Functions |
| Scheduled Task | Functions (Timer) | Container Apps Jobs |
| Event Processor | Functions | Container Apps |
| SSR Web App | Container Apps | App Service, AKS |
| REST/GraphQL API | Container Apps | App Service, Functions, AKS |
| Background Worker | Container Apps | Functions, AKS |
| Scheduled Task | Functions (Timer) | Container Apps Jobs, Kubernetes CronJob (on AKS) |
| Event Processor | Functions | Container Apps, AKS + KEDA |
| Microservices (full K8s) | AKS | Container Apps |
| GPU/ML Workloads | AKS | Azure ML |

### Data

Expand Down
103 changes: 103 additions & 0 deletions tests/azure-kubernetes/__snapshots__/triggers.test.ts.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
// Jest Snapshot v1, https://goo.gl/fbAQLP

exports[`azure-kubernetes - Trigger Tests Trigger Keywords Snapshot skill description triggers match snapshot 1`] = `
{
"description": "Plan and create production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 decisions and Day-1 configuration, cluster SKUs (Automatic vs Standard), security, monitoring, reliability/performance best practices, upgrades, and networking. WHEN: create AKS cluster, plan AKS configuration, design AKS networking, AKS Automatic vs Standard, AKS security, AKS upgrade strategy, AKS autoscaling, AKS monitoring setup, AKS cost analysis, Day-0 checklist.",
"extractedKeywords": [
"aks",
"analysis",
"automatic",
"autoscaling",
"azure",
"best",
"checklist",
"cli",
"cluster",
"clusters",
"configuration",
"container",
"cost",
"covers",
"create",
"day-0",
"day-1",
"decisions",
"deploy",
"design",
"diagnostic",
"entra",
"identity",
"key vault",
"kubernetes",
"mcp",
"monitor",
"monitoring",
"networking",
"observability",
"performance",
"plan",
"practices",
"production-ready",
"reliability",
"security",
"service",
"setup",
"skus",
"standard",
"strategy",
"upgrade",
"upgrades",
"when",
],
"name": "azure-kubernetes",
}
`;

exports[`azure-kubernetes - Trigger Tests Trigger Keywords Snapshot skill keywords match snapshot 1`] = `
[
"aks",
"analysis",
"automatic",
"autoscaling",
"azure",
"best",
"checklist",
"cli",
"cluster",
"clusters",
"configuration",
"container",
"cost",
"covers",
"create",
"day-0",
"day-1",
"decisions",
"deploy",
"design",
"diagnostic",
"entra",
"identity",
"key vault",
"kubernetes",
"mcp",
"monitor",
"monitoring",
"networking",
"observability",
"performance",
"plan",
"practices",
"production-ready",
"reliability",
"security",
"service",
"setup",
"skus",
"standard",
"strategy",
"upgrade",
"upgrades",
"when",
]
`;
Loading
Loading