|
| 1 | +# Capacity Planning for HyperPod Clusters |
| 2 | + |
| 3 | +Deep-dive companion to the main [SKILL.md](../SKILL.md) § B (Capacity & AZ) and the `--validate` pre-create mode. Capacity errors are one of the most common cluster-creation failures. This reference covers how to choose the right capacity strategy, verify availability, and resolve capacity-related failures. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Capacity Acquisition Options |
| 8 | + |
| 9 | +### 1. On-Demand Instances |
| 10 | + |
| 11 | +**Best for:** Small instance types, short-term experiments, development clusters. |
| 12 | + |
| 13 | +- No upfront commitment |
| 14 | +- Available immediately for common types (g5, p3) |
| 15 | +- **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2) |
| 16 | +- Instances may not be allocated in physical proximity → suboptimal network topology for distributed training |
| 17 | +- Higher hourly cost |
| 18 | + |
| 19 | +```bash |
| 20 | +# Check where an instance type is available: |
| 21 | +aws ec2 describe-instance-type-offerings \ |
| 22 | + --location-type availability-zone \ |
| 23 | + --filters "Name=instance-type,Values=ml.p5.48xlarge" \ |
| 24 | + --region us-west-2 \ |
| 25 | + --query 'InstanceTypeOfferings[*].Location' --output table |
| 26 | +``` |
| 27 | + |
| 28 | +### 2. Flexible Training Plans |
| 29 | + |
| 30 | +**Best for:** Medium to large workloads with predictable schedules. |
| 31 | + |
| 32 | +Query available capacity by instance type, count, and desired schedule. AWS returns available options with pricing. |
| 33 | + |
| 34 | +```bash |
| 35 | +# List active training plans: |
| 36 | +aws sagemaker list-training-plans \ |
| 37 | + --filters Name=Status,Value=Active \ |
| 38 | + --region <REGION> \ |
| 39 | + --query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \ |
| 40 | + --output table |
| 41 | +``` |
| 42 | + |
| 43 | +**Using with HyperPod:** |
| 44 | + |
| 45 | +```bash |
| 46 | +aws sagemaker create-cluster \ |
| 47 | + --cluster-name my-cluster \ |
| 48 | + --instance-groups '[{ |
| 49 | + "InstanceGroupName": "gpu-workers", |
| 50 | + "InstanceType": "ml.p5.48xlarge", |
| 51 | + "InstanceCount": 4, |
| 52 | + "ExecutionRole": "arn:aws:iam::<ACCT>:role/HyperPodRole", |
| 53 | + "TrainingPlanArn": "arn:aws:sagemaker:<REGION>:<ACCT>:training-plan/<PLAN_NAME>", |
| 54 | + "LifeCycleConfig": { |
| 55 | + "SourceS3Uri": "s3://sagemaker-lifecycle-<guid>/", |
| 56 | + "OnCreate": "on_create.sh" |
| 57 | + } |
| 58 | + }]' \ |
| 59 | + --vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \ |
| 60 | + --region <REGION> |
| 61 | +``` |
| 62 | + |
| 63 | +**Critical:** The subnet must be in the **same AZ** as the training plan's `AvailabilityZone`. |
| 64 | + |
| 65 | +**Training Plan Status Values:** `Pending`, `Active`, `Scheduled`, `Expired`, `Failed` |
| 66 | + |
| 67 | +**Advantages:** |
| 68 | + |
| 69 | +- Guaranteed capacity for reserved period |
| 70 | +- Discounted pricing vs on-demand |
| 71 | +- Better network topology (co-located instances) |
| 72 | + |
| 73 | +**Disadvantages:** |
| 74 | + |
| 75 | +- Requires advance planning and commitment |
| 76 | +- Capacity locked to specific AZ |
| 77 | + |
| 78 | +### 3. Reserved Capacity (ODCR via AWS Account Team) |
| 79 | + |
| 80 | +**Best for:** Large-scale, long-term capacity needs (months+). |
| 81 | + |
| 82 | +- Contact your AWS account team or TAM |
| 83 | +- Best pricing for sustained usage |
| 84 | +- Guaranteed placement in specific AZ |
| 85 | +- Requires longer lead time |
| 86 | + |
| 87 | +**Verification:** |
| 88 | + |
| 89 | +```bash |
| 90 | +# Check reserved capacity details: |
| 91 | +aws sagemaker list-training-plans \ |
| 92 | + --region <REGION> \ |
| 93 | + --query 'TrainingPlanSummaries[?ReservedCapacitySummaries]' |
| 94 | +``` |
| 95 | + |
| 96 | +**ReservedCapacitySummary fields:** |
| 97 | + |
| 98 | +- `ReservedCapacityArn`, `ReservedCapacityType` (UltraServer or Instance) |
| 99 | +- `InstanceType`, `TotalInstanceCount`, `AvailabilityZone` |
| 100 | +- `DurationHours`, `DurationMinutes`, `StartTime`, `EndTime`, `Status` |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## AZ Selection Strategy |
| 105 | + |
| 106 | +### The Problem |
| 107 | + |
| 108 | +Instance type availability varies by AZ. A subnet in `us-west-2a` may have capacity, while `us-west-2c` does not. Worse, AZ names (e.g., `us-west-2a`) map to different physical zones per AWS account. |
| 109 | + |
| 110 | +### Use AZ IDs for Consistency |
| 111 | + |
| 112 | +AZ IDs (e.g., `usw2-az1`) are consistent across accounts: |
| 113 | + |
| 114 | +```bash |
| 115 | +# Map AZ names to IDs: |
| 116 | +aws ec2 describe-availability-zones --region <REGION> \ |
| 117 | + --query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table |
| 118 | +``` |
| 119 | + |
| 120 | +When coordinating with AWS Support or account teams about reserved capacity, always use **AZ IDs** (not names). |
| 121 | + |
| 122 | +### Verify Subnet Matches Capacity AZ |
| 123 | + |
| 124 | +```bash |
| 125 | +# Your subnet's AZ: |
| 126 | +aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \ |
| 127 | + --query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}' |
| 128 | + |
| 129 | +# Instance type availability per AZ: |
| 130 | +aws ec2 describe-instance-type-offerings \ |
| 131 | + --location-type availability-zone-id \ |
| 132 | + --filters "Name=instance-type,Values=<TYPE>" \ |
| 133 | + --region <REGION> \ |
| 134 | + --query 'InstanceTypeOfferings[*].Location' |
| 135 | +``` |
| 136 | + |
| 137 | +If your subnet's AZ doesn't appear in the instance type offerings list, create a new subnet in an AZ that does. |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## Subnet IP Capacity |
| 142 | + |
| 143 | +GPU instances consume many network interfaces (and IPs) per instance: |
| 144 | + |
| 145 | +| Instance Type | ENIs | IPs per ENI | Total IPs (Slurm) | Total IPs (EKS) | |
| 146 | +| ---------------- | ---- | ------------------------ | ----------------- | --------------- | |
| 147 | +| ml.p5.48xlarge | 32 | 1 primary + 49 secondary | ~32 | ~81 | |
| 148 | +| ml.p5e.48xlarge | 32 | same | ~32 | ~81 | |
| 149 | +| ml.p4d.24xlarge | 4 | 1 primary + 49 secondary | ~4 | ~51 | |
| 150 | +| ml.p4de.24xlarge | 4 | same | ~4 | ~51 | |
| 151 | +| ml.trn1.32xlarge | 8 | 1 primary + 49 secondary | ~8 | ~57 | |
| 152 | +| ml.trn2.48xlarge | 16 | same | ~16 | ~65 | |
| 153 | +| ml.g5.48xlarge | 2 | 1 primary + 14 secondary | ~2 | ~15 | |
| 154 | + |
| 155 | +### Calculate Required IPs |
| 156 | + |
| 157 | +``` |
| 158 | +Required IPs = Instance Count × IPs per Instance |
| 159 | +``` |
| 160 | + |
| 161 | +For example: 16 × ml.p5.48xlarge on EKS = 16 × 81 = 1,296 IPs → requires at least a /21 subnet (2,048 IPs). |
| 162 | + |
| 163 | +### Recommended Subnet Sizes |
| 164 | + |
| 165 | +| Cluster Size (p5) | Orchestrator | Min Subnet CIDR | |
| 166 | +| ----------------- | ------------ | ---------------------------- | |
| 167 | +| 4 instances | Slurm | /25 (128 IPs) | |
| 168 | +| 4 instances | EKS | /24 (256 IPs, plus overhead) | |
| 169 | +| 16 instances | Slurm | /23 (512 IPs) | |
| 170 | +| 16 instances | EKS | /21 (2,048 IPs) | |
| 171 | +| 64 instances | Slurm | /21 (2,048 IPs) | |
| 172 | +| 64 instances | EKS | /19 (8,192 IPs) | |
| 173 | + |
| 174 | +**Subnet CIDRs cannot be changed after creation.** Plan for growth. |
| 175 | + |
| 176 | +```bash |
| 177 | +# Check current availability: |
| 178 | +aws ec2 describe-subnets --subnet-ids <SUBNET> --region <REGION> \ |
| 179 | + --query 'Subnets[0].{CIDR:CidrBlock,TotalIPs:CidrBlock,FreeIPs:AvailableIpAddressCount}' |
| 180 | +``` |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Service Quotas |
| 185 | + |
| 186 | +Check these **before** creating a cluster: |
| 187 | + |
| 188 | +```bash |
| 189 | +# List SageMaker quotas (search for "cluster"): |
| 190 | +aws service-quotas list-service-quotas \ |
| 191 | + --service-code sagemaker --region <REGION> \ |
| 192 | + --query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`Cluster`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \ |
| 193 | + --output table |
| 194 | +``` |
| 195 | + |
| 196 | +| Quota | Default | What Happens If Exceeded | |
| 197 | +| -------------------------------- | ---------------- | --------------------------------------- | |
| 198 | +| `ml.<type> for cluster usage` | Varies | `CreateCluster` fails with quota error | |
| 199 | +| Max instances per cluster | Account-specific | Cannot add more instance groups | |
| 200 | +| Total instances across clusters | Account-specific | Must delete existing clusters first | |
| 201 | +| Max EBS volume size per instance | 16,384 GB | `CreateCluster` fails if config exceeds | |
| 202 | +| VPCs per region | 5 | CFN VPC creation fails | |
| 203 | +| Network interfaces per region | 5,000 | Instance provisioning fails silently | |
| 204 | +| Elastic IPs per region | 5 | NAT Gateway creation fails | |
| 205 | + |
| 206 | +**Request quota increases proactively** — increases can take 1-3 business days. |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +## Troubleshooting Capacity Failures |
| 211 | + |
| 212 | +### "Insufficient capacity" Error |
| 213 | + |
| 214 | +1. Check which AZs have the instance type available (see commands above) |
| 215 | +2. Verify your subnet is in one of those AZs |
| 216 | +3. If no AZ has capacity: try a different region, instance type, or contact account team |
| 217 | +4. If using Training Plan: verify `TrainingPlanArn` and subnet AZ match |
| 218 | + |
| 219 | +### "No subnets in the capacity AZ" Error |
| 220 | + |
| 221 | +The cluster configuration specifies subnets, but none of them are in the AZ where AWS has capacity. |
| 222 | + |
| 223 | +Fix: Create a new subnet in the AZ where capacity exists and add it to the cluster configuration. |
| 224 | + |
| 225 | +### Cluster Stuck in "Creating" (No Progress) |
| 226 | + |
| 227 | +1. Check `list-cluster-events` for error messages |
| 228 | +2. If no events: likely waiting for capacity |
| 229 | +3. If events show failures: fix the indicated issue |
| 230 | +4. If stuck >1 hour with no events: contact AWS Support |
| 231 | + |
| 232 | +### Partial Provisioning (Some Nodes Running, Others Failing) |
| 233 | + |
| 234 | +This typically means capacity was available for some instances but not all. |
| 235 | + |
| 236 | +- The cluster will keep retrying if `NodeProvisioningMode=Continuous` |
| 237 | +- Check events for the specific instance group that's failing |
| 238 | +- Consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling |
0 commit comments