Skip to content

Commit 9b1c98e

Browse files
committed
feat(sagemaker-ai): add HyperPod debugger skills and fix existing skill issues
Add six new diagnostic skills for SageMaker HyperPod: - hyperpod-cluster-debugger: cluster-wide lifecycle issues (EKS/Slurm) - hyperpod-mfu-debugger: MFU degradation triage - hyperpod-nccl: NCCL failure diagnosis - hyperpod-node-debugger: per-node health issues - hyperpod-performance-debugger: NCCL bandwidth, filesystem, GPU failures - hyperpod-slurm-debugger: Slurm node-state management Also fixes: - hyperpod-issue-report: simplify SSM troubleshooting message - hyperpod-version-checker: remove unused RED color var, replace IS_GPU flag with runtime nvidia-smi detection Bumps sagemaker-ai plugin version to 1.1.1.
1 parent 5305a2b commit 9b1c98e

43 files changed

Lines changed: 17141 additions & 6 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

plugins/sagemaker-ai/.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,5 @@
1818
"license": "Apache-2.0",
1919
"name": "sagemaker-ai",
2020
"repository": "https://github.com/awslabs/agent-plugins",
21-
"version": "1.1.0"
21+
"version": "1.1.1"
2222
}

plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md

Lines changed: 267 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
# Capacity Planning for HyperPod Clusters
2+
3+
Deep-dive companion to the main [SKILL.md](../SKILL.md) § B (Capacity & AZ) and the `--validate` pre-create mode. Capacity errors are one of the most common cluster-creation failures. This reference covers how to choose the right capacity strategy, verify availability, and resolve capacity-related failures.
4+
5+
---
6+
7+
## Capacity Acquisition Options
8+
9+
### 1. On-Demand Instances
10+
11+
**Best for:** Small instance types, short-term experiments, development clusters.
12+
13+
- No upfront commitment
14+
- Available immediately for common types (g5, p3)
15+
- **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2)
16+
- Instances may not be allocated in physical proximity → suboptimal network topology for distributed training
17+
- Higher hourly cost
18+
19+
```bash
20+
# Check where an instance type is available:
21+
aws ec2 describe-instance-type-offerings \
22+
--location-type availability-zone \
23+
--filters "Name=instance-type,Values=ml.p5.48xlarge" \
24+
--region us-west-2 \
25+
--query 'InstanceTypeOfferings[*].Location' --output table
26+
```
27+
28+
### 2. Flexible Training Plans
29+
30+
**Best for:** Medium to large workloads with predictable schedules.
31+
32+
Query available capacity by instance type, count, and desired schedule. AWS returns available options with pricing.
33+
34+
```bash
35+
# List active training plans:
36+
aws sagemaker list-training-plans \
37+
--filters Name=Status,Value=Active \
38+
--region <REGION> \
39+
--query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \
40+
--output table
41+
```
42+
43+
**Using with HyperPod:**
44+
45+
```bash
46+
aws sagemaker create-cluster \
47+
--cluster-name my-cluster \
48+
--instance-groups '[{
49+
"InstanceGroupName": "gpu-workers",
50+
"InstanceType": "ml.p5.48xlarge",
51+
"InstanceCount": 4,
52+
"ExecutionRole": "arn:aws:iam::<ACCT>:role/HyperPodRole",
53+
"TrainingPlanArn": "arn:aws:sagemaker:<REGION>:<ACCT>:training-plan/<PLAN_NAME>",
54+
"LifeCycleConfig": {
55+
"SourceS3Uri": "s3://sagemaker-lifecycle-<guid>/",
56+
"OnCreate": "on_create.sh"
57+
}
58+
}]' \
59+
--vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \
60+
--region <REGION>
61+
```
62+
63+
**Critical:** The subnet must be in the **same AZ** as the training plan's `AvailabilityZone`.
64+
65+
**Training Plan Status Values:** `Pending`, `Active`, `Scheduled`, `Expired`, `Failed`
66+
67+
**Advantages:**
68+
69+
- Guaranteed capacity for reserved period
70+
- Discounted pricing vs on-demand
71+
- Better network topology (co-located instances)
72+
73+
**Disadvantages:**
74+
75+
- Requires advance planning and commitment
76+
- Capacity locked to specific AZ
77+
78+
### 3. Reserved Capacity (ODCR via AWS Account Team)
79+
80+
**Best for:** Large-scale, long-term capacity needs (months+).
81+
82+
- Contact your AWS account team or TAM
83+
- Best pricing for sustained usage
84+
- Guaranteed placement in specific AZ
85+
- Requires longer lead time
86+
87+
**Verification:**
88+
89+
```bash
90+
# Check reserved capacity details:
91+
aws sagemaker list-training-plans \
92+
--region <REGION> \
93+
--query 'TrainingPlanSummaries[?ReservedCapacitySummaries]'
94+
```
95+
96+
**ReservedCapacitySummary fields:**
97+
98+
- `ReservedCapacityArn`, `ReservedCapacityType` (UltraServer or Instance)
99+
- `InstanceType`, `TotalInstanceCount`, `AvailabilityZone`
100+
- `DurationHours`, `DurationMinutes`, `StartTime`, `EndTime`, `Status`
101+
102+
---
103+
104+
## AZ Selection Strategy
105+
106+
### The Problem
107+
108+
Instance type availability varies by AZ. A subnet in `us-west-2a` may have capacity, while `us-west-2c` does not. Worse, AZ names (e.g., `us-west-2a`) map to different physical zones per AWS account.
109+
110+
### Use AZ IDs for Consistency
111+
112+
AZ IDs (e.g., `usw2-az1`) are consistent across accounts:
113+
114+
```bash
115+
# Map AZ names to IDs:
116+
aws ec2 describe-availability-zones --region <REGION> \
117+
--query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table
118+
```
119+
120+
When coordinating with AWS Support or account teams about reserved capacity, always use **AZ IDs** (not names).
121+
122+
### Verify Subnet Matches Capacity AZ
123+
124+
```bash
125+
# Your subnet's AZ:
126+
aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
127+
--query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}'
128+
129+
# Instance type availability per AZ:
130+
aws ec2 describe-instance-type-offerings \
131+
--location-type availability-zone-id \
132+
--filters "Name=instance-type,Values=<TYPE>" \
133+
--region <REGION> \
134+
--query 'InstanceTypeOfferings[*].Location'
135+
```
136+
137+
If your subnet's AZ doesn't appear in the instance type offerings list, create a new subnet in an AZ that does.
138+
139+
---
140+
141+
## Subnet IP Capacity
142+
143+
GPU instances consume many network interfaces (and IPs) per instance:
144+
145+
| Instance Type | ENIs | IPs per ENI | Total IPs (Slurm) | Total IPs (EKS) |
146+
| ---------------- | ---- | ------------------------ | ----------------- | --------------- |
147+
| ml.p5.48xlarge | 32 | 1 primary + 49 secondary | ~32 | ~81 |
148+
| ml.p5e.48xlarge | 32 | same | ~32 | ~81 |
149+
| ml.p4d.24xlarge | 4 | 1 primary + 49 secondary | ~4 | ~51 |
150+
| ml.p4de.24xlarge | 4 | same | ~4 | ~51 |
151+
| ml.trn1.32xlarge | 8 | 1 primary + 49 secondary | ~8 | ~57 |
152+
| ml.trn2.48xlarge | 16 | same | ~16 | ~65 |
153+
| ml.g5.48xlarge | 2 | 1 primary + 14 secondary | ~2 | ~15 |
154+
155+
### Calculate Required IPs
156+
157+
```
158+
Required IPs = Instance Count × IPs per Instance
159+
```
160+
161+
For example: 16 × ml.p5.48xlarge on EKS = 16 × 81 = 1,296 IPs → requires at least a /21 subnet (2,048 IPs).
162+
163+
### Recommended Subnet Sizes
164+
165+
| Cluster Size (p5) | Orchestrator | Min Subnet CIDR |
166+
| ----------------- | ------------ | ---------------------------- |
167+
| 4 instances | Slurm | /25 (128 IPs) |
168+
| 4 instances | EKS | /24 (256 IPs, plus overhead) |
169+
| 16 instances | Slurm | /23 (512 IPs) |
170+
| 16 instances | EKS | /21 (2,048 IPs) |
171+
| 64 instances | Slurm | /21 (2,048 IPs) |
172+
| 64 instances | EKS | /19 (8,192 IPs) |
173+
174+
**Subnet CIDRs cannot be changed after creation.** Plan for growth.
175+
176+
```bash
177+
# Check current availability:
178+
aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \
179+
--query 'Subnets[0].{CIDR:CidrBlock,TotalIPs:CidrBlock,FreeIPs:AvailableIpAddressCount}'
180+
```
181+
182+
---
183+
184+
## Service Quotas
185+
186+
Check these **before** creating a cluster:
187+
188+
```bash
189+
# List SageMaker quotas (search for "cluster"):
190+
aws service-quotas list-service-quotas \
191+
--service-code sagemaker --region <REGION> \
192+
--query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`Cluster`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \
193+
--output table
194+
```
195+
196+
| Quota | Default | What Happens If Exceeded |
197+
| -------------------------------- | ---------------- | --------------------------------------- |
198+
| `ml.<type> for cluster usage` | Varies | `CreateCluster` fails with quota error |
199+
| Max instances per cluster | Account-specific | Cannot add more instance groups |
200+
| Total instances across clusters | Account-specific | Must delete existing clusters first |
201+
| Max EBS volume size per instance | 16,384 GB | `CreateCluster` fails if config exceeds |
202+
| VPCs per region | 5 | CFN VPC creation fails |
203+
| Network interfaces per region | 5,000 | Instance provisioning fails silently |
204+
| Elastic IPs per region | 5 | NAT Gateway creation fails |
205+
206+
**Request quota increases proactively** — increases can take 1-3 business days.
207+
208+
---
209+
210+
## Troubleshooting Capacity Failures
211+
212+
### "Insufficient capacity" Error
213+
214+
1. Check which AZs have the instance type available (see commands above)
215+
2. Verify your subnet is in one of those AZs
216+
3. If no AZ has capacity: try a different region, instance type, or contact account team
217+
4. If using Training Plan: verify `TrainingPlanArn` and subnet AZ match
218+
219+
### "No subnets in the capacity AZ" Error
220+
221+
The cluster configuration specifies subnets, but none of them are in the AZ where AWS has capacity.
222+
223+
Fix: Create a new subnet in the AZ where capacity exists and add it to the cluster configuration.
224+
225+
### Cluster Stuck in "Creating" (No Progress)
226+
227+
1. Check `list-cluster-events` for error messages
228+
2. If no events: likely waiting for capacity
229+
3. If events show failures: fix the indicated issue
230+
4. If stuck >1 hour with no events: contact AWS Support
231+
232+
### Partial Provisioning (Some Nodes Running, Others Failing)
233+
234+
This typically means capacity was available for some instances but not all.
235+
236+
- The cluster will keep retrying if `NodeProvisioningMode=Continuous`
237+
- Check events for the specific instance group that's failing
238+
- Consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# CloudFormation Error Reference for HyperPod Deployments
2+
3+
Deep-dive companion to the main [SKILL.md](../SKILL.md) § H (CloudFormation Errors). When deploying HyperPod via the SageMaker console or CloudFormation templates, failures surface as `CREATE_FAILED` or `ROLLBACK_COMPLETE` at the top-level stack. The actual root cause is usually buried several levels deep in nested stacks.
4+
5+
---
6+
7+
## Navigating Nested Stacks
8+
9+
### Stack Hierarchy (Console Deployments)
10+
11+
Typical HyperPod console deployment creates this stack structure:
12+
13+
```
14+
Top-Level Stack (HyperPod-<name>)
15+
├── NetworkStack (VPC, subnets, IGW, NAT, SG, S3 endpoint)
16+
├── StorageStack (FSx Lustre, optional OpenZFS)
17+
├── IAMStack (execution role, instance profile)
18+
├── S3Stack (lifecycle scripts bucket + upload)
19+
└── ClusterStack (AWS::SageMaker::Cluster resource)
20+
└── [The cluster resource itself — most failures end here]
21+
```
22+
23+
### Step-by-Step Navigation
24+
25+
1. **CloudFormation Console** → ensure correct region → find the HyperPod stack
26+
2. **Status filter:** look for `CREATE_FAILED` or `ROLLBACK_COMPLETE`
27+
3. **Events tab** → filter by `CREATE_FAILED` → note the earliest failure timestamp
28+
4. **Resources tab** → find `AWS::CloudFormation::Stack` type entries with `CREATE_FAILED`
29+
5. **Click Physical ID** of the failed nested stack
30+
6. **Repeat** until reaching a stack with only leaf resources (no further `AWS::CloudFormation::Stack`)
31+
7. **Read Status Reason** on the failed leaf resource — this is the root cause
32+
33+
### Tip: Find Root Cause via CLI
34+
35+
```bash
36+
# List all failed events across all stacks (requires stack name or ID):
37+
aws cloudformation describe-stack-events \
38+
--stack-name <TOP_LEVEL_STACK_NAME> \
39+
--region <REGION> \
40+
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{Time:Timestamp,Resource:LogicalResourceId,Type:ResourceType,Reason:ResourceStatusReason}' \
41+
--output table
42+
43+
# For nested stacks — get the nested stack's name from Resources tab:
44+
aws cloudformation describe-stack-events \
45+
--stack-name <NESTED_STACK_ID> \
46+
--region <REGION> \
47+
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
48+
```
49+
50+
---
51+
52+
## Resource Error Catalog
53+
54+
### AWS::SageMaker::Cluster
55+
56+
| Status Reason | Root Cause | Fix |
57+
| ----------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------- |
58+
| `Insufficient capacity in the Availability Zone` | No on-demand instances available in AZ | Add subnet in different AZ; use Flexible Training Plans or reserved capacity |
59+
| `No subnets in the capacity AZ` | Cluster subnet not in the AZ where capacity exists | Create subnet in the AZ where instances are available |
60+
| `EFA health checks did not run successfully` | Security group missing self-referencing rules | Add inbound + outbound self-ref rules on SG (protocol: All, source: self) |
61+
| `Lifecycle scripts did not run successfully` | Script error, S3 access, or timeout | Check CloudWatch logs: `/aws/sagemaker/Clusters/<name>/<id>` |
62+
| `Instance bootstrap failed due to network misconfiguration` | VPC routing or SG issue | Verify NAT Gateway route, S3 VPC endpoint, SG rules |
63+
| `The security group 'sg-xxx' does not exist` | SG ID is wrong or in different region | Verify SG exists in the same region and VPC |
64+
| `The subnet 'subnet-xxx' does not exist` | Subnet ID is wrong or in different region | Verify subnet exists in the same region |
65+
| `You are not authorized to perform this operation` | Execution role missing permissions | Add `AmazonSageMakerClusterInstanceRolePolicy` + VPC permissions |
66+
| `The maximum number of instances ... has been reached` | Service quota exceeded | Request quota increase via Service Quotas console |
67+
68+
### AWS::IAM::Role
69+
70+
| Status Reason | Root Cause | Fix |
71+
| ----------------------------------------- | ------------------------------------- | ---------------------------------------------------------- |
72+
| `Cannot exceed quota for PoliciesPerRole` | Too many managed policies attached | Consolidate inline policies; limit is 10 managed per role |
73+
| `Invalid principal in policy` | Trust policy references wrong service | Use `"Service": "sagemaker.amazonaws.com"` in trust policy |
74+
| `MalformedPolicyDocument` | JSON syntax error in inline policy | Validate JSON; check for trailing commas, missing quotes |
75+
| `EntityAlreadyExists` | Role name already taken | Use unique name or import existing role |
76+
77+
### AWS::EC2::VPC / Subnet / SecurityGroup
78+
79+
| Status Reason | Root Cause | Fix |
80+
| ---------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------ |
81+
| `The CIDR 'x.x.x.x/y' conflicts with another subnet` | Overlapping CIDR in same VPC | Use non-overlapping CIDR blocks |
82+
| `The maximum number of VPCs has been reached` | VPC quota per region (default: 5) | Request VPC quota increase |
83+
| `InvalidGroup.Duplicate` | SG rule already exists | Skip — not a real error (idempotency issue in template) |
84+
| `RulesPerSecurityGroupLimitExceeded` | More than 60 inbound or 60 outbound rules | Consolidate rules; use CIDR ranges instead of individual IPs |
85+
86+
### AWS::FSx::FileSystem
87+
88+
| Status Reason | Root Cause | Fix |
89+
| ----------------------------------------------- | --------------------------------------- | ---------------------------------------------------- |
90+
| `The subnet is not in a supported AZ` | FSx Lustre not available in subnet's AZ | Use a subnet in an AZ that supports FSx Lustre |
91+
| `The security group does not belong to the VPC` | SG and subnet in different VPCs | Move SG or subnet to same VPC |
92+
| `Insufficient storage capacity` | FSx Lustre capacity exhausted in AZ | Try different AZ or reduce storage size |
93+
| `Invalid deployment type for storage type` | Template uses incompatible FSx config | PERSISTENT_2 requires SSD; check template parameters |
94+
95+
### AWS::Lambda::Function (Custom Resources)
96+
97+
| Status Reason | Root Cause | Fix |
98+
| ------------------------------------------------ | ------------------------------------ | --------------------------------------------------------- |
99+
| `<error message from Lambda>` (Custom::Resource) | Lambda-backed custom resource failed | Find the Lambda function name → check its CloudWatch logs |
100+
| `Timed out` | Lambda exceeded 15-minute limit | Custom resource handler is too slow; check what it does |
101+
102+
**To debug Custom::Resource failures:**
103+
104+
```bash
105+
# Find Lambda function name from CFN Resources tab, then:
106+
aws logs tail /aws/lambda/<FUNCTION_NAME> --region <REGION> --since 1h
107+
```
108+
109+
---
110+
111+
## Rolled-Back Stacks
112+
113+
When a stack rolls back, CloudFormation deletes the resources it created. To view rolled-back stacks:
114+
115+
1. CloudFormation Console → **Deleted** filter (top-right dropdown)
116+
2. Or via CLI:
117+
118+
```bash
119+
aws cloudformation list-stacks \
120+
--stack-status-filter ROLLBACK_COMPLETE DELETE_COMPLETE \
121+
--region <REGION> \
122+
--query 'StackSummaries[?contains(StackName,`HyperPod`) || contains(StackName,`hyperpod`)].{Name:StackName,Status:StackStatus,Time:CreationTime}' \
123+
--output table
124+
```
125+
126+
---
127+
128+
## CFN Template Gotchas
129+
130+
### ThreadsPerCore
131+
132+
`ThreadsPerCore` defaults to 1 (hyperthreading disabled) when set via console "Advanced Configuration." This makes p5.48xlarge show 96 vCPU instead of 192. Fix: set `ThreadsPerCore: 2` explicitly.
133+
134+
Any `UpdateCluster` call via CFN **must include ThreadsPerCore** even if not originally set — omitting it resets to default.
135+
136+
### S3 Bucket Naming
137+
138+
The `SourceS3Uri` must match pattern `s3://sagemaker-*` per API validation. CFN templates typically create a bucket named `sagemaker-lifecycle-<guid>`.
139+
140+
### Condition-Dependent Resources
141+
142+
If using the reference HyperPod CFN template, some resources are conditional:
143+
144+
- FSx OpenZFS: only created if `CreateOpenZFS=true`
145+
- S3 VPC Endpoint: only created if `CreateS3Endpoint=true`
146+
- SSM Session Document: only if `CreateSSMSessionDocument=true`
147+
148+
A condition evaluating to `false` means the resource is skipped (not failed).

0 commit comments

Comments
 (0)