Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion plugins/sagemaker-ai/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@
"license": "Apache-2.0",
"name": "sagemaker-ai",
"repository": "https://github.com/awslabs/agent-plugins",
"version": "1.1.0"
"version": "1.1.1"
}
267 changes: 267 additions & 0 deletions plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# Capacity Planning for HyperPod Clusters

Deep-dive companion to the main [SKILL.md](../SKILL.md) § B (Capacity & AZ) and the `--validate` pre-create mode. Capacity errors are one of the most common cluster-creation failures. This reference covers how to choose the right capacity strategy, verify availability, and resolve capacity-related failures.

---

## Capacity Acquisition Options

### 1. On-Demand Instances

**Best for:** Small instance types, short-term experiments, development clusters.

- No upfront commitment
- Available immediately for common types (g5, p3)
- **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2)
- Instances may not be allocated in physical proximity → suboptimal network topology for distributed training
- Higher hourly cost

```bash
# Check where an instance type is available:
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters "Name=instance-type,Values=ml.p5.48xlarge" \
--region us-west-2 \
--query 'InstanceTypeOfferings[*].Location' --output table
```

### 2. Flexible Training Plans

**Best for:** Medium to large workloads with predictable schedules.

Query available capacity by instance type, count, and desired schedule. AWS returns available options with pricing.

```bash
# List active training plans:
aws sagemaker list-training-plans \
--filters Name=Status,Value=Active \
--region <REGION> \
--query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \
--output table
```

**Using with HyperPod:**

```bash
aws sagemaker create-cluster \
--cluster-name my-cluster \
--instance-groups '[{
"InstanceGroupName": "gpu-workers",
"InstanceType": "ml.p5.48xlarge",
"InstanceCount": 4,
"ExecutionRole": "arn:aws:iam::<ACCT>:role/HyperPodRole",
"TrainingPlanArn": "arn:aws:sagemaker:<REGION>:<ACCT>:training-plan/<PLAN_NAME>",
"LifeCycleConfig": {
"SourceS3Uri": "s3://sagemaker-lifecycle-<guid>/",
"OnCreate": "on_create.sh"
}
}]' \
--vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \
--region <REGION>
```

**Critical:** The subnet must be in the **same AZ** as the training plan's `AvailabilityZone`.

**Training Plan Status Values:** `Pending`, `Active`, `Scheduled`, `Expired`, `Failed`

**Advantages:**

- Guaranteed capacity for reserved period
- Discounted pricing vs on-demand
- Better network topology (co-located instances)

**Disadvantages:**

- Requires advance planning and commitment
- Capacity locked to specific AZ

### 3. Reserved Capacity (ODCR via AWS Account Team)

**Best for:** Large-scale, long-term capacity needs (months+).

- Contact your AWS account team or TAM
- Best pricing for sustained usage
- Guaranteed placement in specific AZ
- Requires longer lead time

**Verification:**

```bash
# Check reserved capacity details:
aws sagemaker list-training-plans \
--region <REGION> \
--query 'TrainingPlanSummaries[?ReservedCapacitySummaries]'
```

**ReservedCapacitySummary fields:**

- `ReservedCapacityArn`, `ReservedCapacityType` (UltraServer or Instance)
- `InstanceType`, `TotalInstanceCount`, `AvailabilityZone`
- `DurationHours`, `DurationMinutes`, `StartTime`, `EndTime`, `Status`

---

## AZ Selection Strategy

### The Problem

Instance type availability varies by AZ. A subnet in `us-west-2a` may have capacity, while `us-west-2c` does not. Worse, AZ names (e.g., `us-west-2a`) map to different physical zones per AWS account.

### Use AZ IDs for Consistency

AZ IDs (e.g., `usw2-az1`) are consistent across accounts:

```bash
# Map AZ names to IDs:
aws ec2 describe-availability-zones --region <REGION> \
--query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table
```

When coordinating with AWS Support or account teams about reserved capacity, always use **AZ IDs** (not names).

### Verify Subnet Matches Capacity AZ

```bash
# Your subnet's AZ:
aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
--query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}'

# Instance type availability per AZ:
aws ec2 describe-instance-type-offerings \
--location-type availability-zone-id \
--filters "Name=instance-type,Values=<TYPE>" \
--region <REGION> \
--query 'InstanceTypeOfferings[*].Location'
```

If your subnet's AZ doesn't appear in the instance type offerings list, create a new subnet in an AZ that does.

---

## Subnet IP Capacity

GPU instances consume many network interfaces (and IPs) per instance:

| Instance Type | ENIs | IPs per ENI | Total IPs (Slurm) | Total IPs (EKS) |
| ---------------- | ---- | ------------------------ | ----------------- | --------------- |
| ml.p5.48xlarge | 32 | 1 primary + 49 secondary | ~32 | ~81 |
| ml.p5e.48xlarge | 32 | same | ~32 | ~81 |
| ml.p4d.24xlarge | 4 | 1 primary + 49 secondary | ~4 | ~51 |
| ml.p4de.24xlarge | 4 | same | ~4 | ~51 |
| ml.trn1.32xlarge | 8 | 1 primary + 49 secondary | ~8 | ~57 |
| ml.trn2.48xlarge | 16 | same | ~16 | ~65 |
| ml.g5.48xlarge | 2 | 1 primary + 14 secondary | ~2 | ~15 |

### Calculate Required IPs

```
Required IPs = Instance Count × IPs per Instance
```

For example: 16 × ml.p5.48xlarge on EKS = 16 × 81 = 1,296 IPs → requires at least a /21 subnet (2,048 IPs).

### Recommended Subnet Sizes

| Cluster Size (p5) | Orchestrator | Min Subnet CIDR |
| ----------------- | ------------ | ---------------------------- |
| 4 instances | Slurm | /25 (128 IPs) |
| 4 instances | EKS | /24 (256 IPs, plus overhead) |
| 16 instances | Slurm | /23 (512 IPs) |
| 16 instances | EKS | /21 (2,048 IPs) |
| 64 instances | Slurm | /21 (2,048 IPs) |
| 64 instances | EKS | /19 (8,192 IPs) |

**Subnet CIDRs cannot be changed after creation.** Plan for growth.

```bash
# Check current availability:
aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
--query 'Subnets[0].{CIDR:CidrBlock,TotalIPs:CidrBlock,FreeIPs:AvailableIpAddressCount}'
```

---

## Service Quotas

Check these **before** creating a cluster:

```bash
# List SageMaker quotas (search for "cluster"):
aws service-quotas list-service-quotas \
--service-code sagemaker --region <REGION> \
--query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`Cluster`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \
--output table
```

| Quota | Default | What Happens If Exceeded |
| -------------------------------- | ---------------- | --------------------------------------- |
| `ml.<type> for cluster usage` | Varies | `CreateCluster` fails with quota error |
| Max instances per cluster | Account-specific | Cannot add more instance groups |
| Total instances across clusters | Account-specific | Must delete existing clusters first |
| Max EBS volume size per instance | 16,384 GB | `CreateCluster` fails if config exceeds |
| VPCs per region | 5 | CFN VPC creation fails |
| Network interfaces per region | 5,000 | Instance provisioning fails silently |
| Elastic IPs per region | 5 | NAT Gateway creation fails |

**Request quota increases proactively** — increases can take 1-3 business days.

---

## Troubleshooting Capacity Failures

### "Insufficient capacity" Error

1. Check which AZs have the instance type available (see commands above)
2. Verify your subnet is in one of those AZs
3. If no AZ has capacity: try a different region, instance type, or contact account team
4. If using Training Plan: verify `TrainingPlanArn` and subnet AZ match

### "No subnets in the capacity AZ" Error

The cluster configuration specifies subnets, but none of them are in the AZ where AWS has capacity.

Fix: Create a new subnet in the AZ where capacity exists and add it to the cluster configuration.

### Cluster Stuck in "Creating" (No Progress)

1. Check `list-cluster-events` for error messages
2. If no events: likely waiting for capacity
3. If events show failures: fix the indicated issue
4. If stuck >1 hour with no events: contact AWS Support

### Partial Provisioning (Some Nodes Running, Others Failing)

This typically means capacity was available for some instances but not all.

- The cluster will keep retrying if `NodeProvisioningMode=Continuous`
- Check events for the specific instance group that's failing
- Consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# CloudFormation Error Reference for HyperPod Deployments

Deep-dive companion to the main [SKILL.md](../SKILL.md) § H (CloudFormation Errors). When deploying HyperPod via the SageMaker console or CloudFormation templates, failures surface as `CREATE_FAILED` or `ROLLBACK_COMPLETE` at the top-level stack. The actual root cause is usually buried several levels deep in nested stacks.

---

## Navigating Nested Stacks

### Stack Hierarchy (Console Deployments)

Typical HyperPod console deployment creates this stack structure:

```
Top-Level Stack (HyperPod-<name>)
├── NetworkStack (VPC, subnets, IGW, NAT, SG, S3 endpoint)
├── StorageStack (FSx Lustre, optional OpenZFS)
├── IAMStack (execution role, instance profile)
├── S3Stack (lifecycle scripts bucket + upload)
└── ClusterStack (AWS::SageMaker::Cluster resource)
└── [The cluster resource itself — most failures end here]
```

### Step-by-Step Navigation

1. **CloudFormation Console** → ensure correct region → find the HyperPod stack
2. **Status filter:** look for `CREATE_FAILED` or `ROLLBACK_COMPLETE`
3. **Events tab** → filter by `CREATE_FAILED` → note the earliest failure timestamp
4. **Resources tab** → find `AWS::CloudFormation::Stack` type entries with `CREATE_FAILED`
5. **Click Physical ID** of the failed nested stack
6. **Repeat** until reaching a stack with only leaf resources (no further `AWS::CloudFormation::Stack`)
7. **Read Status Reason** on the failed leaf resource — this is the root cause

### Tip: Find Root Cause via CLI

```bash
# List all failed events across all stacks (requires stack name or ID):
aws cloudformation describe-stack-events \
--stack-name <TOP_LEVEL_STACK_NAME> \
--region <REGION> \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{Time:Timestamp,Resource:LogicalResourceId,Type:ResourceType,Reason:ResourceStatusReason}' \
--output table

# For nested stacks — get the nested stack's name from Resources tab:
aws cloudformation describe-stack-events \
--stack-name <NESTED_STACK_ID> \
--region <REGION> \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
```

---

## Resource Error Catalog

### AWS::SageMaker::Cluster

| Status Reason | Root Cause | Fix |
| ----------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------- |
| `Insufficient capacity in the Availability Zone` | No on-demand instances available in AZ | Add subnet in different AZ; use Flexible Training Plans or reserved capacity |
| `No subnets in the capacity AZ` | Cluster subnet not in the AZ where capacity exists | Create subnet in the AZ where instances are available |
| `EFA health checks did not run successfully` | Security group missing self-referencing rules | Add inbound + outbound self-ref rules on SG (protocol: All, source: self) |
| `Lifecycle scripts did not run successfully` | Script error, S3 access, or timeout | Check CloudWatch logs: `/aws/sagemaker/Clusters/<name>/<id>` |
| `Instance bootstrap failed due to network misconfiguration` | VPC routing or SG issue | Verify NAT Gateway route, S3 VPC endpoint, SG rules |
| `The security group 'sg-xxx' does not exist` | SG ID is wrong or in different region | Verify SG exists in the same region and VPC |
| `The subnet 'subnet-xxx' does not exist` | Subnet ID is wrong or in different region | Verify subnet exists in the same region |
| `You are not authorized to perform this operation` | Execution role missing permissions | Add `AmazonSageMakerClusterInstanceRolePolicy` + VPC permissions |
| `The maximum number of instances ... has been reached` | Service quota exceeded | Request quota increase via Service Quotas console |

### AWS::IAM::Role

| Status Reason | Root Cause | Fix |
| ----------------------------------------- | ------------------------------------- | ---------------------------------------------------------- |
| `Cannot exceed quota for PoliciesPerRole` | Too many managed policies attached | Consolidate inline policies; limit is 10 managed per role |
| `Invalid principal in policy` | Trust policy references wrong service | Use `"Service": "sagemaker.amazonaws.com"` in trust policy |
| `MalformedPolicyDocument` | JSON syntax error in inline policy | Validate JSON; check for trailing commas, missing quotes |
| `EntityAlreadyExists` | Role name already taken | Use unique name or import existing role |

### AWS::EC2::VPC / Subnet / SecurityGroup

| Status Reason | Root Cause | Fix |
| ---------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------ |
| `The CIDR 'x.x.x.x/y' conflicts with another subnet` | Overlapping CIDR in same VPC | Use non-overlapping CIDR blocks |
| `The maximum number of VPCs has been reached` | VPC quota per region (default: 5) | Request VPC quota increase |
| `InvalidGroup.Duplicate` | SG rule already exists | Skip — not a real error (idempotency issue in template) |
| `RulesPerSecurityGroupLimitExceeded` | More than 60 inbound or 60 outbound rules | Consolidate rules; use CIDR ranges instead of individual IPs |

### AWS::FSx::FileSystem

| Status Reason | Root Cause | Fix |
| ----------------------------------------------- | --------------------------------------- | ---------------------------------------------------- |
| `The subnet is not in a supported AZ` | FSx Lustre not available in subnet's AZ | Use a subnet in an AZ that supports FSx Lustre |
| `The security group does not belong to the VPC` | SG and subnet in different VPCs | Move SG or subnet to same VPC |
| `Insufficient storage capacity` | FSx Lustre capacity exhausted in AZ | Try different AZ or reduce storage size |
| `Invalid deployment type for storage type` | Template uses incompatible FSx config | PERSISTENT_2 requires SSD; check template parameters |

### AWS::Lambda::Function (Custom Resources)

| Status Reason | Root Cause | Fix |
| ------------------------------------------------ | ------------------------------------ | --------------------------------------------------------- |
| `<error message from Lambda>` (Custom::Resource) | Lambda-backed custom resource failed | Find the Lambda function name → check its CloudWatch logs |
| `Timed out` | Lambda exceeded 15-minute limit | Custom resource handler is too slow; check what it does |

**To debug Custom::Resource failures:**

```bash
# Find Lambda function name from CFN Resources tab, then:
aws logs tail /aws/lambda/<FUNCTION_NAME> --region <REGION> --since 1h
```

---

## Rolled-Back Stacks

When a stack rolls back, CloudFormation deletes the resources it created. To view rolled-back stacks:

1. CloudFormation Console → **Deleted** filter (top-right dropdown)
2. Or via CLI:

```bash
aws cloudformation list-stacks \
--stack-status-filter ROLLBACK_COMPLETE DELETE_COMPLETE \
--region <REGION> \
--query 'StackSummaries[?contains(StackName,`HyperPod`) || contains(StackName,`hyperpod`)].{Name:StackName,Status:StackStatus,Time:CreationTime}' \
--output table
```

---

## CFN Template Gotchas

### ThreadsPerCore

`ThreadsPerCore` defaults to 1 (hyperthreading disabled) when set via console "Advanced Configuration." This makes p5.48xlarge show 96 vCPU instead of 192. Fix: set `ThreadsPerCore: 2` explicitly.

Any `UpdateCluster` call via CFN **must include ThreadsPerCore** even if not originally set — omitting it resets to default.

### S3 Bucket Naming

The `SourceS3Uri` must match pattern `s3://sagemaker-*` per API validation. CFN templates typically create a bucket named `sagemaker-lifecycle-<guid>`.

### Condition-Dependent Resources

If using the reference HyperPod CFN template, some resources are conditional:

- FSx OpenZFS: only created if `CreateOpenZFS=true`
- S3 VPC Endpoint: only created if `CreateS3Endpoint=true`
- SSM Session Document: only if `CreateSSMSessionDocument=true`

A condition evaluating to `false` means the resource is skipped (not failed).
Loading
Loading