awslabs
diff --git a/‎plugins/sagemaker-ai/.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion b/‎plugins/sagemaker-ai/.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md‎
Lines changed: 267 additions & 0 deletions b/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md‎
Lines changed: 267 additions & 0 deletions
diff --git a/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md‎
Lines changed: 238 additions & 0 deletions b/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md‎
Lines changed: 238 additions & 0 deletions
diff --git a/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md‎
Lines changed: 148 additions & 0 deletions b/‎plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md‎
Lines changed: 148 additions & 0 deletions
@@ -18,5 +18,5 @@
   "license": "Apache-2.0",
   "name": "sagemaker-ai",
   "repository": "https://github.com/awslabs/agent-plugins",
-  "version": "1.1.0"
+  "version": "1.1.1"
 }
@@ -0,0 +1,238 @@
+# Capacity Planning for HyperPod Clusters
+
+Deep-dive companion to the main [SKILL.md](../SKILL.md) § B (Capacity & AZ) and the `--validate` pre-create mode. Capacity errors are one of the most common cluster-creation failures. This reference covers how to choose the right capacity strategy, verify availability, and resolve capacity-related failures.
+
+---
+
+## Capacity Acquisition Options
+
+### 1. On-Demand Instances
+
+**Best for:** Small instance types, short-term experiments, development clusters.
+
+- No upfront commitment
+- Available immediately for common types (g5, p3)
+- **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2)
+- Instances may not be allocated in physical proximity → suboptimal network topology for distributed training
+- Higher hourly cost
+
+```bash
+# Check where an instance type is available:
+aws ec2 describe-instance-type-offerings \
+  --location-type availability-zone \
+  --filters "Name=instance-type,Values=ml.p5.48xlarge" \
+  --region us-west-2 \
+  --query 'InstanceTypeOfferings[*].Location' --output table
+```
+
+### 2. Flexible Training Plans
+
+**Best for:** Medium to large workloads with predictable schedules.
+
+Query available capacity by instance type, count, and desired schedule. AWS returns available options with pricing.
+
+```bash
+# List active training plans:
+aws sagemaker list-training-plans \
+  --filters Name=Status,Value=Active \
+  --region <REGION> \
+  --query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \
+  --output table
+```
+
+**Using with HyperPod:**
+
+```bash
+aws sagemaker create-cluster \
+  --cluster-name my-cluster \
+  --instance-groups '[{
+    "InstanceGroupName": "gpu-workers",
+    "InstanceType": "ml.p5.48xlarge",
+    "InstanceCount": 4,
+    "ExecutionRole": "arn:aws:iam::<ACCT>:role/HyperPodRole",
+    "TrainingPlanArn": "arn:aws:sagemaker:<REGION>:<ACCT>:training-plan/<PLAN_NAME>",
+    "LifeCycleConfig": {
+      "SourceS3Uri": "s3://sagemaker-lifecycle-<guid>/",
+      "OnCreate": "on_create.sh"
+    }
+  }]' \
+  --vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \
+  --region <REGION>
+```
+
+**Critical:** The subnet must be in the **same AZ** as the training plan's `AvailabilityZone`.
+
+**Training Plan Status Values:** `Pending`, `Active`, `Scheduled`, `Expired`, `Failed`
+
+**Advantages:**
+
+- Guaranteed capacity for reserved period
+- Discounted pricing vs on-demand
+- Better network topology (co-located instances)
+
+**Disadvantages:**
+
+- Requires advance planning and commitment
+- Capacity locked to specific AZ
+
+### 3. Reserved Capacity (ODCR via AWS Account Team)
+
+**Best for:** Large-scale, long-term capacity needs (months+).
+
+- Contact your AWS account team or TAM
+- Best pricing for sustained usage
+- Guaranteed placement in specific AZ
+- Requires longer lead time
+
+**Verification:**
+
+```bash
+# Check reserved capacity details:
+aws sagemaker list-training-plans \
+  --region <REGION> \
+  --query 'TrainingPlanSummaries[?ReservedCapacitySummaries]'
+```
+
+**ReservedCapacitySummary fields:**
+
+- `ReservedCapacityArn`, `ReservedCapacityType` (UltraServer or Instance)
+- `InstanceType`, `TotalInstanceCount`, `AvailabilityZone`
+- `DurationHours`, `DurationMinutes`, `StartTime`, `EndTime`, `Status`
+
+---
+
+## AZ Selection Strategy
+
+### The Problem
+
+Instance type availability varies by AZ. A subnet in `us-west-2a` may have capacity, while `us-west-2c` does not. Worse, AZ names (e.g., `us-west-2a`) map to different physical zones per AWS account.
+
+### Use AZ IDs for Consistency
+
+AZ IDs (e.g., `usw2-az1`) are consistent across accounts:
+
+```bash
+# Map AZ names to IDs:
+aws ec2 describe-availability-zones --region <REGION> \
+  --query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table
+```
+
+When coordinating with AWS Support or account teams about reserved capacity, always use **AZ IDs** (not names).
+
+### Verify Subnet Matches Capacity AZ
+
+```bash
+# Your subnet's AZ:
+aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
+  --query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}'
+
+# Instance type availability per AZ:
+aws ec2 describe-instance-type-offerings \
+  --location-type availability-zone-id \
+  --filters "Name=instance-type,Values=<TYPE>" \
+  --region <REGION> \
+  --query 'InstanceTypeOfferings[*].Location'
+```
+
+If your subnet's AZ doesn't appear in the instance type offerings list, create a new subnet in an AZ that does.
+
+---
+
+## Subnet IP Capacity
+
+GPU instances consume many network interfaces (and IPs) per instance:
+
+| Instance Type    | ENIs | IPs per ENI              | Total IPs (Slurm) | Total IPs (EKS) |
+| ---------------- | ---- | ------------------------ | ----------------- | --------------- |
+| ml.p5.48xlarge   | 32   | 1 primary + 49 secondary | ~32               | ~81             |
+| ml.p5e.48xlarge  | 32   | same                     | ~32               | ~81             |
+| ml.p4d.24xlarge  | 4    | 1 primary + 49 secondary | ~4                | ~51             |
+| ml.p4de.24xlarge | 4    | same                     | ~4                | ~51             |
+| ml.trn1.32xlarge | 8    | 1 primary + 49 secondary | ~8                | ~57             |
+| ml.trn2.48xlarge | 16   | same                     | ~16               | ~65             |
+| ml.g5.48xlarge   | 2    | 1 primary + 14 secondary | ~2                | ~15             |
+
+### Calculate Required IPs
+
+```
+Required IPs = Instance Count × IPs per Instance
+```
+
+For example: 16 × ml.p5.48xlarge on EKS = 16 × 81 = 1,296 IPs → requires at least a /21 subnet (2,048 IPs).
+
+### Recommended Subnet Sizes
+
+| Cluster Size (p5) | Orchestrator | Min Subnet CIDR              |
+| ----------------- | ------------ | ---------------------------- |
+| 4 instances       | Slurm        | /25 (128 IPs)                |
+| 4 instances       | EKS          | /24 (256 IPs, plus overhead) |
+| 16 instances      | Slurm        | /23 (512 IPs)                |
+| 16 instances      | EKS          | /21 (2,048 IPs)              |
+| 64 instances      | Slurm        | /21 (2,048 IPs)              |
+| 64 instances      | EKS          | /19 (8,192 IPs)              |
+
+**Subnet CIDRs cannot be changed after creation.** Plan for growth.
+
+```bash
+# Check current availability:
+aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
+  --query 'Subnets[0].{CIDR:CidrBlock,TotalIPs:CidrBlock,FreeIPs:AvailableIpAddressCount}'
+```
+
+---
+
+## Service Quotas
+
+Check these **before** creating a cluster:
+
+```bash
+# List SageMaker quotas (search for "cluster"):
+aws service-quotas list-service-quotas \
+  --service-code sagemaker --region <REGION> \
+  --query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`Cluster`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \
+  --output table
+```
+
+| Quota                            | Default          | What Happens If Exceeded                |
+| -------------------------------- | ---------------- | --------------------------------------- |
+| `ml.<type> for cluster usage`    | Varies           | `CreateCluster` fails with quota error  |
+| Max instances per cluster        | Account-specific | Cannot add more instance groups         |
+| Total instances across clusters  | Account-specific | Must delete existing clusters first     |
+| Max EBS volume size per instance | 16,384 GB        | `CreateCluster` fails if config exceeds |
+| VPCs per region                  | 5                | CFN VPC creation fails                  |
+| Network interfaces per region    | 5,000            | Instance provisioning fails silently    |
+| Elastic IPs per region           | 5                | NAT Gateway creation fails              |
+
+**Request quota increases proactively** — increases can take 1-3 business days.
+
+---
+
+## Troubleshooting Capacity Failures
+
+### "Insufficient capacity" Error
+
+1. Check which AZs have the instance type available (see commands above)
+2. Verify your subnet is in one of those AZs
+3. If no AZ has capacity: try a different region, instance type, or contact account team
+4. If using Training Plan: verify `TrainingPlanArn` and subnet AZ match
+
+### "No subnets in the capacity AZ" Error
+
+The cluster configuration specifies subnets, but none of them are in the AZ where AWS has capacity.
+
+Fix: Create a new subnet in the AZ where capacity exists and add it to the cluster configuration.
+
+### Cluster Stuck in "Creating" (No Progress)
+
+1. Check `list-cluster-events` for error messages
+2. If no events: likely waiting for capacity
+3. If events show failures: fix the indicated issue
+4. If stuck >1 hour with no events: contact AWS Support
+
+### Partial Provisioning (Some Nodes Running, Others Failing)
+
+This typically means capacity was available for some instances but not all.
+
+- The cluster will keep retrying if `NodeProvisioningMode=Continuous`
+- Check events for the specific instance group that's failing
+- Consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling
@@ -0,0 +1,148 @@
+# CloudFormation Error Reference for HyperPod Deployments
+
+Deep-dive companion to the main [SKILL.md](../SKILL.md) § H (CloudFormation Errors). When deploying HyperPod via the SageMaker console or CloudFormation templates, failures surface as `CREATE_FAILED` or `ROLLBACK_COMPLETE` at the top-level stack. The actual root cause is usually buried several levels deep in nested stacks.
+
+---
+
+## Navigating Nested Stacks
+
+### Stack Hierarchy (Console Deployments)
+
+Typical HyperPod console deployment creates this stack structure:
+
+```
+Top-Level Stack (HyperPod-<name>)
+├── NetworkStack (VPC, subnets, IGW, NAT, SG, S3 endpoint)
+├── StorageStack (FSx Lustre, optional OpenZFS)
+├── IAMStack (execution role, instance profile)
+├── S3Stack (lifecycle scripts bucket + upload)
+└── ClusterStack (AWS::SageMaker::Cluster resource)
+    └── [The cluster resource itself — most failures end here]
+```
+
+### Step-by-Step Navigation
+
+1. **CloudFormation Console** → ensure correct region → find the HyperPod stack
+2. **Status filter:** look for `CREATE_FAILED` or `ROLLBACK_COMPLETE`
+3. **Events tab** → filter by `CREATE_FAILED` → note the earliest failure timestamp
+4. **Resources tab** → find `AWS::CloudFormation::Stack` type entries with `CREATE_FAILED`
+5. **Click Physical ID** of the failed nested stack
+6. **Repeat** until reaching a stack with only leaf resources (no further `AWS::CloudFormation::Stack`)
+7. **Read Status Reason** on the failed leaf resource — this is the root cause
+
+### Tip: Find Root Cause via CLI
+
+```bash
+# List all failed events across all stacks (requires stack name or ID):
+aws cloudformation describe-stack-events \
+  --stack-name <TOP_LEVEL_STACK_NAME> \
+  --region <REGION> \
+  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{Time:Timestamp,Resource:LogicalResourceId,Type:ResourceType,Reason:ResourceStatusReason}' \
+  --output table
+
+# For nested stacks — get the nested stack's name from Resources tab:
+aws cloudformation describe-stack-events \
+  --stack-name <NESTED_STACK_ID> \
+  --region <REGION> \
+  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
+```
+
+---
+
+## Resource Error Catalog
+
+### AWS::SageMaker::Cluster
+
+| Status Reason                                               | Root Cause                                         | Fix                                                                          |
+| ----------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------- |
+| `Insufficient capacity in the Availability Zone`            | No on-demand instances available in AZ             | Add subnet in different AZ; use Flexible Training Plans or reserved capacity |
+| `No subnets in the capacity AZ`                             | Cluster subnet not in the AZ where capacity exists | Create subnet in the AZ where instances are available                        |
+| `EFA health checks did not run successfully`                | Security group missing self-referencing rules      | Add inbound + outbound self-ref rules on SG (protocol: All, source: self)    |
+| `Lifecycle scripts did not run successfully`                | Script error, S3 access, or timeout                | Check CloudWatch logs: `/aws/sagemaker/Clusters/<name>/<id>`                 |
+| `Instance bootstrap failed due to network misconfiguration` | VPC routing or SG issue                            | Verify NAT Gateway route, S3 VPC endpoint, SG rules                          |
+| `The security group 'sg-xxx' does not exist`                | SG ID is wrong or in different region              | Verify SG exists in the same region and VPC                                  |
+| `The subnet 'subnet-xxx' does not exist`                    | Subnet ID is wrong or in different region          | Verify subnet exists in the same region                                      |
+| `You are not authorized to perform this operation`          | Execution role missing permissions                 | Add `AmazonSageMakerClusterInstanceRolePolicy` + VPC permissions             |
+| `The maximum number of instances ... has been reached`      | Service quota exceeded                             | Request quota increase via Service Quotas console                            |
+
+### AWS::IAM::Role
+
+| Status Reason                             | Root Cause                            | Fix                                                        |
+| ----------------------------------------- | ------------------------------------- | ---------------------------------------------------------- |
+| `Cannot exceed quota for PoliciesPerRole` | Too many managed policies attached    | Consolidate inline policies; limit is 10 managed per role  |
+| `Invalid principal in policy`             | Trust policy references wrong service | Use `"Service": "sagemaker.amazonaws.com"` in trust policy |
+| `MalformedPolicyDocument`                 | JSON syntax error in inline policy    | Validate JSON; check for trailing commas, missing quotes   |
+| `EntityAlreadyExists`                     | Role name already taken               | Use unique name or import existing role                    |
+
+### AWS::EC2::VPC / Subnet / SecurityGroup
+
+| Status Reason                                        | Root Cause                                | Fix                                                          |
+| ---------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------ |
+| `The CIDR 'x.x.x.x/y' conflicts with another subnet` | Overlapping CIDR in same VPC              | Use non-overlapping CIDR blocks                              |
+| `The maximum number of VPCs has been reached`        | VPC quota per region (default: 5)         | Request VPC quota increase                                   |
+| `InvalidGroup.Duplicate`                             | SG rule already exists                    | Skip — not a real error (idempotency issue in template)      |
+| `RulesPerSecurityGroupLimitExceeded`                 | More than 60 inbound or 60 outbound rules | Consolidate rules; use CIDR ranges instead of individual IPs |
+
+### AWS::FSx::FileSystem
+
+| Status Reason                                   | Root Cause                              | Fix                                                  |
+| ----------------------------------------------- | --------------------------------------- | ---------------------------------------------------- |
+| `The subnet is not in a supported AZ`           | FSx Lustre not available in subnet's AZ | Use a subnet in an AZ that supports FSx Lustre       |
+| `The security group does not belong to the VPC` | SG and subnet in different VPCs         | Move SG or subnet to same VPC                        |
+| `Insufficient storage capacity`                 | FSx Lustre capacity exhausted in AZ     | Try different AZ or reduce storage size              |
+| `Invalid deployment type for storage type`      | Template uses incompatible FSx config   | PERSISTENT_2 requires SSD; check template parameters |
+
+### AWS::Lambda::Function (Custom Resources)
+
+| Status Reason                                    | Root Cause                           | Fix                                                       |
+| ------------------------------------------------ | ------------------------------------ | --------------------------------------------------------- |
+| `<error message from Lambda>` (Custom::Resource) | Lambda-backed custom resource failed | Find the Lambda function name → check its CloudWatch logs |
+| `Timed out`                                      | Lambda exceeded 15-minute limit      | Custom resource handler is too slow; check what it does   |
+
+**To debug Custom::Resource failures:**
+
+```bash
+# Find Lambda function name from CFN Resources tab, then:
+aws logs tail /aws/lambda/<FUNCTION_NAME> --region <REGION> --since 1h
+```
+
+---
+
+## Rolled-Back Stacks
+
+When a stack rolls back, CloudFormation deletes the resources it created. To view rolled-back stacks:
+
+1. CloudFormation Console → **Deleted** filter (top-right dropdown)
+2. Or via CLI:
+
+   ```bash
+   aws cloudformation list-stacks \
+     --stack-status-filter ROLLBACK_COMPLETE DELETE_COMPLETE \
+     --region <REGION> \
+     --query 'StackSummaries[?contains(StackName,`HyperPod`) || contains(StackName,`hyperpod`)].{Name:StackName,Status:StackStatus,Time:CreationTime}' \
+     --output table
+   ```
+
+---
+
+## CFN Template Gotchas
+
+### ThreadsPerCore
+
+`ThreadsPerCore` defaults to 1 (hyperthreading disabled) when set via console "Advanced Configuration." This makes p5.48xlarge show 96 vCPU instead of 192. Fix: set `ThreadsPerCore: 2` explicitly.
+
+Any `UpdateCluster` call via CFN **must include ThreadsPerCore** even if not originally set — omitting it resets to default.
+
+### S3 Bucket Naming
+
+The `SourceS3Uri` must match pattern `s3://sagemaker-*` per API validation. CFN templates typically create a bucket named `sagemaker-lifecycle-<guid>`.
+
+### Condition-Dependent Resources
+
+If using the reference HyperPod CFN template, some resources are conditional:
+
+- FSx OpenZFS: only created if `CreateOpenZFS=true`
+- S3 VPC Endpoint: only created if `CreateS3Endpoint=true`
+- SSM Session Document: only if `CreateSSMSessionDocument=true`
+
+A condition evaluating to `false` means the resource is skipped (not failed).
Original file line number	Diff line number	Diff line change
`@@ -18,5 +18,5 @@`
`18`	`18`	`"license": "Apache-2.0",`
`19`	`19`	`"name": "sagemaker-ai",`
`20`	`20`	`"repository": "https://github.com/awslabs/agent-plugins",`
`21`		`- "version": "1.1.0"`
	`21`	`+ "version": "1.1.1"`
`22`	`22`	`}`