diff --git a/plugins/sagemaker-ai/.claude-plugin/plugin.json b/plugins/sagemaker-ai/.claude-plugin/plugin.json index b0f4dd4d..64c015fc 100644 --- a/plugins/sagemaker-ai/.claude-plugin/plugin.json +++ b/plugins/sagemaker-ai/.claude-plugin/plugin.json @@ -18,5 +18,5 @@ "license": "Apache-2.0", "name": "sagemaker-ai", "repository": "https://github.com/awslabs/agent-plugins", - "version": "1.1.0" + "version": "1.1.1" } diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md new file mode 100644 index 00000000..0973717d --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md @@ -0,0 +1,267 @@ +--- +name: hyperpod-cluster-debugger +description: Use for cluster-wide SageMaker HyperPod issues (EKS or Slurm) across the full cluster lifecycle — pre-create validation and creation/deployment failures (CloudFormation CREATE_FAILED / ROLLBACK_COMPLETE / "Embedded stack failed", stuck in Creating/Updating/Failed, "EFA health checks did not run successfully", "Lifecycle scripts did not run" / timed out, "Insufficient capacity" / "No subnets in the capacity AZ", "Instance bootstrap failed...network misconfiguration", service-linked role missing, S3 lifecycle / CRLF / on_create.sh); plus post-deployment ops (EKS access entries / kubectl auth, EKS add-ons, AMI / UpdateClusterSoftware rollback, ClusterMaintenanceRollbackFailed, dangling nodes, autoscaler/Karpenter conflicts, service quotas, permission-boundary denials, Slurm controller). Read-only. `--validate` pre-flight checks SGs / subnets / IAM / VPC endpoints / S3 lifecycle / per-AZ capacity. Not for per-node issues (hyperpod-node-debugger), NCCL (hyperpod-nccl), or MFU (hyperpod-mfu-debugger). +metadata: + version: "1.0.0" +--- + +# HyperPod Cluster Debugger + +Read-only diagnostic for cluster-level HyperPod issues across the full cluster lifecycle: **pre-create validation**, **deployment/creation failures**, and **post-deployment operations** (cluster-wide health, AMI upgrades, dangling nodes, autoscaler conflicts, Slurm controller, node replacement). Supports **EKS** and **Slurm**. + +**Clear separation of concerns:** + +- `scripts/diagnose-cluster.sh` is a **read-only signal collector**. It reads cluster state via AWS APIs and (for Slurm controller health) SSM, then prints each detected issue as a `[FAIL]` line. Every `[FAIL]` line ends with a pointer of the form `→ references/cluster-diagnostics-detail.md §
` or `→ references/cluster-operations.md § `. The script never prints remediation commands and never modifies cluster state. +- [references/cluster-diagnostics-detail.md](references/cluster-diagnostics-detail.md) contains the full remediation runbook per section (A–L). +- [references/cluster-operations.md](references/cluster-operations.md) contains operational deep-dives (EFA SG, capacity, lifecycle, EKS access, SSM, node replacement, Slurm operations). +- [references/cloudformation-errors.md](references/cloudformation-errors.md) is the full CloudFormation resource-by-resource error catalog (nested-stack navigation, `AWS::SageMaker::Cluster`/`AWS::IAM::Role`/`AWS::FSx::FileSystem`/`Custom::Resource`/etc.) — open this when § H points you into deep CFN debugging. +- [references/capacity-planning.md](references/capacity-planning.md) is the in-depth capacity strategy guide (on-demand vs. Flexible Training Plans vs. ODCR, AZ/AZ-ID selection, subnet IP sizing per instance type, service quotas) — open this when § B or pre-create validation flags capacity/subnet sizing. +- [references/lifecycle-scripts.md](references/lifecycle-scripts.md) is the in-depth lifecycle-script reference (S3 layout for Slurm/EKS, execution order, `config.py` toggles, on-node debug under `/var/log/provision/`) — open this when § C points you at a specific lifecycle failure. +- This SKILL.md is the **playbook for Claude**: run the script, read each finding's pointer, open the referenced section, walk the customer through the fix with explicit approval. + +**Always run Step 1 first** — it collects all cluster signals and produces a prioritized issue list with reference pointers. + +--- + +## Workflow (authoritative) + +1. **Collect inputs** — HyperPod cluster name (not EKS name), region, exact error message from console/CLI/CloudFormation. +2. **Run `scripts/diagnose-cluster.sh`** (or `--validate` for pre-create checks). +3. **Read the script output top-to-bottom.** For every `[FAIL]` line, note the trailing `→ references/.md §
` pointer. +4. **Open each referenced section.** Use the `Read` tool on the exact file path. +5. **Present the remediation to the customer** with the finding, root cause, exact command(s), and blast radius. Cluster-level remediations (SG changes, AMI upgrade, kubeconfig overwrite, node replacement, service-linked role creation) have wider blast radius than node-level — describe the impact clearly. +6. **Wait for explicit customer approval** before running any state-changing command. +7. **Re-run the diagnostic** after remediation to confirm. + +--- + +## Step 1: Collect information & run diagnostics + +Ask the customer for: + +- **HyperPod cluster name** (not the EKS cluster name): + + ```bash + aws sagemaker list-clusters --region --query 'ClusterSummaries[*].ClusterName' + ``` + +- **AWS region** — e.g. `us-east-1`, `us-west-2` +- **Error message** — the exact error from console, CLI, or CloudFormation + +Then run the diagnostic script: + +```bash +# Diagnose an existing cluster (read-only; prints findings with references/... pointers): +bash scripts/diagnose-cluster.sh --cluster --region + +# Pre-flight validation (no cluster needed — validates SGs, subnets, IAM, VPC endpoints, +# optionally S3 lifecycle scripts and per-AZ instance-type capacity): +bash scripts/diagnose-cluster.sh --validate --region \ + --sg-ids --subnet-ids [--iam-role ] \ + [--s3-uri s3:///path/] [--instance-type ml.p5.48xlarge] +``` + +The script collects in one pass: cluster status, orchestrator type, provisioning mode, instance-group health, cluster events, VPC/SG configuration, EKS access + add-ons + aws-auth, SSM readiness, CloudWatch log availability, Slurm controller health (when applicable), dangling/orphaned node reconciliation. Issues are categorized: + +- **P0** — Fix immediately (blocks cluster operation) +- **P1** — Fix soon (degraded or at-risk) +- **P2** — Informational (review when convenient) + +### Output tags + +| Tag | Meaning | +| -------- | -------------------------------------------------------------------------------- | +| `[PASS]` | Check passed | +| `[FAIL]` | Problem found — counted in `CRITICAL_FAILURES` with a `→ references/...` pointer | +| `[WARN]` | Advisory | +| `[INFO]` | Informational | + +The script never prints remediation commands. Each `[FAIL]` entry ends with a pointer of the form `→ references/cluster-diagnostics-detail.md §
` (or `→ references/cluster-operations.md § `). Open the referenced section with `Read` to find the remediation runbook. + +--- + +## Step 2: Match signal → section + +**From `list-cluster-events` / error messages:** + +| Event / Error Message | Section | +| ----------------------------------------------------------------------------------- | -------------------------------------------------------------- | +| `"EFA health checks did not run successfully"` | **[A: EFA Health Checks](#a-efa-health-checks)** | +| `"Insufficient capacity"` / `"No subnets in the capacity AZ"` | **[B: Capacity & AZ](#b-capacity--az)** | +| `"Lifecycle scripts did not run successfully"` / `"timed out"` | **[C: Lifecycle Scripts](#c-lifecycle-scripts)** | +| `"the server has asked for the client to provide credentials"` / kubectl auth error | **[D: EKS Access](#d-eks-access--kubectl)** | +| Cluster InService but no instances visible / nodes not showing | **[E: Cluster Provisioning](#e-cluster-provisioning)** | +| `"Target is not connected"` / SSM errors | **[F: SSM Connectivity](#f-ssm-connectivity)** | +| Node replacement not happening / `batch-replace` not working | **[G: Node Replacement](#g-node-replacement)** | +| `"Embedded stack failed"` / CloudFormation error | **[H: CloudFormation Errors](#h-cloudformation-errors)** | +| `"ENI limit exceeded"` / `"vCPU limit"` / service quota error | **[B: Capacity & AZ](#b-capacity--az)** | +| `"UpdateClusterSoftware"` failed / AMI upgrade error | **[J: AMI & Cluster Updates](#j-ami--cluster-updates)** | +| Cluster stuck in `ClusterMaintenanceRollbackFailed` | **[J: AMI & Cluster Updates](#j-ami--cluster-updates)** | +| Dangling nodes on EKS after scale-up rollback | **[K: Dangling Nodes & Cleanup](#k-dangling-nodes--cleanup)** | +| Cluster Autoscaler stops working after HyperPod attached | **[L: Autoscaler Compatibility](#l-autoscaler-compatibility)** | + +**From symptoms:** + +| Symptom | Section | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | +| Cluster creation failed | Run script → follow section pointer | +| Cluster stuck in Creating/Updating/Deleting > 1 hour | **[E: Cluster Provisioning](#e-cluster-provisioning)** | +| Cluster stuck in RollbackFailed / MaintenanceFailed | **[J: AMI & Cluster Updates](#j-ami--cluster-updates)** | +| AMI upgrade silently fails and rolls back | **[J: AMI & Cluster Updates](#j-ami--cluster-updates)** | +| Cluster InService, `kubectl get nodes` returns nothing | **[D](#d-eks-access--kubectl)** then **[E](#e-cluster-provisioning)** | +| Auto-repair enabled but nodes not being replaced | **[G: Node Replacement](#g-node-replacement)** | +| Ran `batch-replace-cluster-nodes` but nothing happened | **[G: Node Replacement](#g-node-replacement)** | +| Can't SSM into nodes | **[F: SSM Connectivity](#f-ssm-connectivity)** | +| Ghost/dangling nodes visible in EKS after rollback | **[K: Dangling Nodes & Cleanup](#k-dangling-nodes--cleanup)** | +| Cluster Autoscaler broken after HyperPod attachment | **[L: Autoscaler Compatibility](#l-autoscaler-compatibility)** | +| Node stuck in Failed after reboot | **[G: Node Replacement](#g-node-replacement)** | +| Topology labels missing on new nodes | **[K: Dangling Nodes & Cleanup](#k-dangling-nodes--cleanup)** | +| Need to find instance ID from Slurm node name | **[I: Utilities](#i-utilities)** | +| Slow I/O, data-loading bottleneck, FSx throughput saturated | [references/cluster-operations.md § 10 Filesystem Performance](references/cluster-operations.md) | + +--- + +## A: EFA Health Checks + +Security group missing self-referencing rules for inter-node EFA — the #1 cluster creation failure. Diagnose with the cluster script or `describe-security-groups`, then add inbound/outbound self-referencing rules plus outbound internet access to every SG used by the cluster. +Full procedure: [references/cluster-diagnostics-detail.md § A](references/cluster-diagnostics-detail.md#a-efa-health-checks). + +## B: Capacity & AZ + +Instance type unavailable in the requested Availability Zone. Check AZ offerings with `describe-instance-type-offerings`, then try a different AZ, use Flexible Training Plans, or request reserved capacity. +Full procedure: [references/cluster-diagnostics-detail.md § B](references/cluster-diagnostics-detail.md#b-capacity--az). + +## C: Lifecycle Scripts + +Lifecycle scripts failed or timed out during provisioning. Check CloudWatch logs under `/aws/sagemaker/Clusters//` for the specific error — common causes: missing S3 VPC endpoints, IAM permission gaps, Windows line endings, instance-group name mismatches. +Full procedure: [references/cluster-diagnostics-detail.md § C](references/cluster-diagnostics-detail.md#c-lifecycle-scripts). + +## D: EKS Access / kubectl + +IAM identity not in EKS access entries or kubeconfig not set up. Verify with `sts get-caller-identity`, check access entries and auth mode on the EKS cluster, then create an access entry with admin policy and update kubeconfig. +Full procedure: [references/cluster-diagnostics-detail.md § D](references/cluster-diagnostics-detail.md#d-eks-access--kubectl). + +## E: Cluster Provisioning + +Cluster shows InService but instances are missing — often expected with Continuous Provisioning (EKS only), where the cluster goes InService before all nodes are created. Check cluster events and node status; failures surface as events, not cluster-level errors. + +**Cluster stuck in Creating/Updating/Deleting > 1 hour:** Check CloudFormation nested stacks for the real error (§ H), verify the IAM execution role has required permissions, check for capacity issues (§ B), and look at cluster events. If stuck in Deleting, check for VPC ENI dependencies. If no progress after 2 hours with no error events, escalate to AWS Support. +Full procedure: [references/cluster-diagnostics-detail.md § E](references/cluster-diagnostics-detail.md#e-cluster-provisioning). + +## F: SSM Connectivity + +SSM session fails with `Target is not connected`. Use the `sagemaker-cluster:` target format (not raw EC2 instance ID), verify the SSM plugin is installed, and confirm the node is Running. Check IAM permissions and VPC endpoints if timeouts persist. +Full procedure: [references/cluster-diagnostics-detail.md § F](references/cluster-diagnostics-detail.md#f-ssm-connectivity). + +## G: Node Replacement + +Auto or manual node replacement not triggering. For auto-replacement, verify `NodeRecovery` is enabled, check health agent logs and node labels/reasons, and confirm capacity. For manual recovery: reboot first, replace only if reboot fails. Cluster must be InService for `batch-replace-cluster-nodes`. +Full procedure: [references/cluster-diagnostics-detail.md § G](references/cluster-diagnostics-detail.md#g-node-replacement). + +## H: CloudFormation Errors + +Nested stack failures produce vague `Embedded stack failed`. Drill into nested stacks via the Events tab filtered by Failed until you reach the actual non-stack resource failure. CLI alternative: `describe-stack-events --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'`. Includes guidance for service-linked role (SLR) failures and permission boundaries. +Full procedure: [references/cluster-diagnostics-detail.md § H](references/cluster-diagnostics-detail.md#h-cloudformation-errors). + +## I: Utilities + +Map Slurm node names (`ip-10-1-123-45`) to HyperPod instance IDs using `list-cluster-nodes` or on-node `resource_config.json`. For large clusters, use the dump utility in `references/cluster-operations.md`. +Full procedure: [references/cluster-diagnostics-detail.md § I](references/cluster-diagnostics-detail.md#i-utilities). + +## J: AMI & Cluster Updates + +`UpdateClusterSoftware` fails silently and rolls back, or the cluster gets stuck in `ClusterMaintenanceRollbackFailed`. Common causes: lifecycle scripts incompatible with the new AMI, insufficient capacity for the rolling update, or IAM gaps. For `RollbackFailed` (non-terminal state), collect diagnostics and escalate — do NOT attempt to delete and recreate. +Full procedure: [references/cluster-diagnostics-detail.md § J](references/cluster-diagnostics-detail.md#j-ami--cluster-updates). + +## K: Dangling Nodes & Cleanup + +After a failed scale-up or rollback, EKS may show nodes that HyperPod no longer manages — "dangling" nodes appear in `kubectl get nodes` but not in `list-cluster-nodes`. The diagnostic script flags them automatically. Topology-label gaps on new nodes typically resolve on the next reconciliation cycle. +Full procedure: [references/cluster-diagnostics-detail.md § K](references/cluster-diagnostics-detail.md#k-dangling-nodes--cleanup). + +## L: Autoscaler Compatibility + +Cluster Autoscaler can conflict with HyperPod-managed node groups because HyperPod controls node lifecycle independently. The fix is to exclude HyperPod node groups from CAS via the node-level `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` annotation (not `safe-to-evict`, which is a pod annotation), or via the `--balancing-ignore-label=sagemaker.amazonaws.com/compute-type` CAS arg. +Full procedure: [references/cluster-diagnostics-detail.md § L](references/cluster-diagnostics-detail.md#l-autoscaler-compatibility). + +**Karpenter:** HyperPod nodes are not managed by Karpenter NodePools and should not conflict. If you see Karpenter disrupting HyperPod nodes, add `karpenter.sh/do-not-disrupt: "true"` to HyperPod pods, or configure a NodePool `requirements` filter that excludes nodes with `sagemaker.amazonaws.com/compute-type=hyperpod`. + +--- + +## Read-only guarantee & remediation principle + +The scripts in this skill never mutate cluster state and never emit remediation commands. Each issue detected points at a `references/.md §
`; open that section with `Read` to find the root cause, exact commands, verification, and blast radius. Cluster-level remediations (SG changes, AMI upgrade, kubeconfig overwrite, node replacement) have wide blast radius — always explain the impact and wait for explicit customer approval before running anything. + +## Prerequisites + +Required on the machine running the skill: + +- `aws` CLI v2.13+ — authenticated to the AWS account that owns the HyperPod cluster. +- `jq` — used for JSON parsing in `--validate` mode and add-on parsing. +- `python3` — used for safe JSON manipulation and SSM payload building. +- `bash` 4.2+. + +Required for EKS cluster checks: + +- `kubectl` — authenticated to the EKS cluster behind the HyperPod cluster. If absent, EKS-specific checks (access entries, add-ons, aws-auth) are skipped. +- `eks:DescribeCluster`, `eks:ListAccessEntries`, `eks:ListAddons`, `eks:DescribeAddon` in the caller's IAM. + +Required for Slurm controller health (SSM-based): + +- `session-manager-plugin`. The controller's instance role must include `AmazonSSMManagedInstanceCore`. + +See [references/cluster-operations.md § 4 EKS Access Control](references/cluster-operations.md) and [§ 6 SSM Target Format](references/cluster-operations.md) for setup. + +## Defaults + +- **Region**: reads `$AWS_DEFAULT_REGION`; if unset, `us-east-1`. +- **Mode**: diagnose an existing cluster (`--cluster `). Use `--validate` for pre-create checks on SGs / subnets / IAM. +- **Output colors**: ANSI colors on; `--no-color` disables. +- **Event window**: `list-cluster-events --max-results 20` (most recent). For long provisioning incidents, cross-check CloudWatch log streams (§ 7 in the script output). +- **Node list pagination**: paginated via `--no-paginate` / `NextToken` up to 5000 nodes. +- **SSM command timeout**: 180 seconds per controller probe with retries for throttling. +- **Read-only**: the script NEVER modifies cluster state and NEVER prints remediation commands. + +## Error Handling + +| Failure mode | Script behavior | What to tell the customer | +| --------------------------------------------------------- | --------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | +| `aws sts get-caller-identity` fails | Exit 1 | "Fix AWS credentials and rerun." | +| Cluster not found | Exit 1 after listing clusters in the region | "Confirm the HyperPod cluster name (not the EKS name) and region." | +| `sagemaker:*` / `ec2:*` / `eks:*` / `logs:*` AccessDenied | Warn, add `Missing IAM permission for `, continue with partial data | "Grant the listed IAM action and rerun." | +| `kubectl` absent / not authenticated | Skip EKS-specific checks (access entries, add-ons, aws-auth, node reconciliation) | "Install/authenticate kubectl; § D in references." | +| SSM plugin absent (Slurm cluster) | Skip Slurm controller probe | "Install session-manager-plugin; § F in references." | +| SSM `send-command` throttled | Retry with backoff; if still throttled, warn and continue | "Rerun later — script is idempotent." | +| SSM command times out (180s) on large Slurm fleets | Return partial output, note in summary | "Rerun during a quiet window or reduce sinfo scope." | +| CloudWatch log group not found | Skip CloudWatch check, continue | "CloudWatch not configured on this cluster; see § 5 in operations.md." | + +Exit codes: `0` = diagnostic complete (issues may still exist — check output); `1` = cluster not found / fatal prerequisite missing / critical failures present in `--validate` mode. + +## IAM permissions required + +See [references/iam-permissions.md](references/iam-permissions.md) for the full IAM policy. + +## Skill delegation + +| Need | Use | +| ----------------------------------------------------------------- | ------------------------------------------------------ | +| Per-node runtime issues (GPU, disk, OOM, Slurm) | `hyperpod-node-debugger` skill | +| SSM failure on a single node | `hyperpod-node-debugger` § K | +| Cluster-wide SSM outage (all nodes unreachable) | stay here — § F | +| Single-node EFA health-check failure post-provisioning | `hyperpod-node-debugger` § A | +| Cluster-wide EFA health-check failure at creation time | stay here — § A | +| Cluster creation / deployment failures (CFN, capacity, lifecycle) | stay here — run `--validate` first, then `§ H / B / C` | +| NCCL timeout / distributed training errors | `hyperpod-nccl` skill | +| Shell access to nodes | `hyperpod-ssm` skill | +| Software version comparison across nodes | `hyperpod-version-checker` skill | +| Diagnostic bundle for AWS Support | `hyperpod-issue-report` skill | +| Training performance / MFU degradation | `hyperpod-mfu-debugger` skill | + +## Escalate to AWS Support when + +1. EFA health checks fail despite all SG rules being correct. +2. Capacity errors persist despite a valid Flexible Training Plan / ODCR. +3. Node replacement keeps failing with no clear error in events or logs. +4. Cluster stuck in a non-terminal state (Creating/Updating) for an extended period. +5. CloudFormation root cause error is an internal service error. + +Collect diagnostics with `scripts/diagnose-cluster.sh` and `hyperpod-issue-report` before escalating. diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md new file mode 100644 index 00000000..a1a8bf47 --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md @@ -0,0 +1,238 @@ +# Capacity Planning for HyperPod Clusters + +Deep-dive companion to the main [SKILL.md](../SKILL.md) § B (Capacity & AZ) and the `--validate` pre-create mode. Capacity errors are one of the most common cluster-creation failures. This reference covers how to choose the right capacity strategy, verify availability, and resolve capacity-related failures. + +--- + +## Capacity Acquisition Options + +### 1. On-Demand Instances + +**Best for:** Small instance types, short-term experiments, development clusters. + +- No upfront commitment +- Available immediately for common types (g5, p3) +- **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2) +- Instances may not be allocated in physical proximity → suboptimal network topology for distributed training +- Higher hourly cost + +```bash +# Check where an instance type is available: +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone \ + --filters "Name=instance-type,Values=ml.p5.48xlarge" \ + --region us-west-2 \ + --query 'InstanceTypeOfferings[*].Location' --output table +``` + +### 2. Flexible Training Plans + +**Best for:** Medium to large workloads with predictable schedules. + +Query available capacity by instance type, count, and desired schedule. AWS returns available options with pricing. + +```bash +# List active training plans: +aws sagemaker list-training-plans \ + --filters Name=Status,Value=Active \ + --region \ + --query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \ + --output table +``` + +**Using with HyperPod:** + +```bash +aws sagemaker create-cluster \ + --cluster-name my-cluster \ + --instance-groups '[{ + "InstanceGroupName": "gpu-workers", + "InstanceType": "ml.p5.48xlarge", + "InstanceCount": 4, + "ExecutionRole": "arn:aws:iam:::role/HyperPodRole", + "TrainingPlanArn": "arn:aws:sagemaker:::training-plan/", + "LifeCycleConfig": { + "SourceS3Uri": "s3://sagemaker-lifecycle-/", + "OnCreate": "on_create.sh" + } + }]' \ + --vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \ + --region +``` + +**Critical:** The subnet must be in the **same AZ** as the training plan's `AvailabilityZone`. + +**Training Plan Status Values:** `Pending`, `Active`, `Scheduled`, `Expired`, `Failed` + +**Advantages:** + +- Guaranteed capacity for reserved period +- Discounted pricing vs on-demand +- Better network topology (co-located instances) + +**Disadvantages:** + +- Requires advance planning and commitment +- Capacity locked to specific AZ + +### 3. Reserved Capacity (ODCR via AWS Account Team) + +**Best for:** Large-scale, long-term capacity needs (months+). + +- Contact your AWS account team or TAM +- Best pricing for sustained usage +- Guaranteed placement in specific AZ +- Requires longer lead time + +**Verification:** + +```bash +# Check reserved capacity details: +aws sagemaker list-training-plans \ + --region \ + --query 'TrainingPlanSummaries[?ReservedCapacitySummaries]' +``` + +**ReservedCapacitySummary fields:** + +- `ReservedCapacityArn`, `ReservedCapacityType` (UltraServer or Instance) +- `InstanceType`, `TotalInstanceCount`, `AvailabilityZone` +- `DurationHours`, `DurationMinutes`, `StartTime`, `EndTime`, `Status` + +--- + +## AZ Selection Strategy + +### The Problem + +Instance type availability varies by AZ. A subnet in `us-west-2a` may have capacity, while `us-west-2c` does not. Worse, AZ names (e.g., `us-west-2a`) map to different physical zones per AWS account. + +### Use AZ IDs for Consistency + +AZ IDs (e.g., `usw2-az1`) are consistent across accounts: + +```bash +# Map AZ names to IDs: +aws ec2 describe-availability-zones --region \ + --query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table +``` + +When coordinating with AWS Support or account teams about reserved capacity, always use **AZ IDs** (not names). + +### Verify Subnet Matches Capacity AZ + +```bash +# Your subnet's AZ: +aws ec2 describe-subnets --subnet-ids --region \ + --query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}' + +# Instance type availability per AZ: +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone-id \ + --filters "Name=instance-type,Values=" \ + --region \ + --query 'InstanceTypeOfferings[*].Location' +``` + +If your subnet's AZ doesn't appear in the instance type offerings list, create a new subnet in an AZ that does. + +--- + +## Subnet IP Capacity + +GPU instances consume many network interfaces (and IPs) per instance: + +| Instance Type | ENIs | IPs per ENI | Total IPs (Slurm) | Total IPs (EKS) | +| ---------------- | ---- | ------------------------ | ----------------- | --------------- | +| ml.p5.48xlarge | 32 | 1 primary + 49 secondary | ~32 | ~81 | +| ml.p5e.48xlarge | 32 | same | ~32 | ~81 | +| ml.p4d.24xlarge | 4 | 1 primary + 49 secondary | ~4 | ~51 | +| ml.p4de.24xlarge | 4 | same | ~4 | ~51 | +| ml.trn1.32xlarge | 8 | 1 primary + 49 secondary | ~8 | ~57 | +| ml.trn2.48xlarge | 16 | same | ~16 | ~65 | +| ml.g5.48xlarge | 2 | 1 primary + 14 secondary | ~2 | ~15 | + +### Calculate Required IPs + +``` +Required IPs = Instance Count × IPs per Instance +``` + +For example: 16 × ml.p5.48xlarge on EKS = 16 × 81 = 1,296 IPs → requires at least a /21 subnet (2,048 IPs). + +### Recommended Subnet Sizes + +| Cluster Size (p5) | Orchestrator | Min Subnet CIDR | +| ----------------- | ------------ | ---------------------------- | +| 4 instances | Slurm | /25 (128 IPs) | +| 4 instances | EKS | /24 (256 IPs, plus overhead) | +| 16 instances | Slurm | /23 (512 IPs) | +| 16 instances | EKS | /21 (2,048 IPs) | +| 64 instances | Slurm | /21 (2,048 IPs) | +| 64 instances | EKS | /19 (8,192 IPs) | + +**Subnet CIDRs cannot be changed after creation.** Plan for growth. + +```bash +# Check current availability: +aws ec2 describe-subnets --subnet-ids --region \ + --query 'Subnets[0].{CIDR:CidrBlock,TotalIPs:CidrBlock,FreeIPs:AvailableIpAddressCount}' +``` + +--- + +## Service Quotas + +Check these **before** creating a cluster: + +```bash +# List SageMaker quotas (search for "cluster"): +aws service-quotas list-service-quotas \ + --service-code sagemaker --region \ + --query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`Cluster`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \ + --output table +``` + +| Quota | Default | What Happens If Exceeded | +| -------------------------------- | ---------------- | --------------------------------------- | +| `ml. for cluster usage` | Varies | `CreateCluster` fails with quota error | +| Max instances per cluster | Account-specific | Cannot add more instance groups | +| Total instances across clusters | Account-specific | Must delete existing clusters first | +| Max EBS volume size per instance | 16,384 GB | `CreateCluster` fails if config exceeds | +| VPCs per region | 5 | CFN VPC creation fails | +| Network interfaces per region | 5,000 | Instance provisioning fails silently | +| Elastic IPs per region | 5 | NAT Gateway creation fails | + +**Request quota increases proactively** — increases can take 1-3 business days. + +--- + +## Troubleshooting Capacity Failures + +### "Insufficient capacity" Error + +1. Check which AZs have the instance type available (see commands above) +2. Verify your subnet is in one of those AZs +3. If no AZ has capacity: try a different region, instance type, or contact account team +4. If using Training Plan: verify `TrainingPlanArn` and subnet AZ match + +### "No subnets in the capacity AZ" Error + +The cluster configuration specifies subnets, but none of them are in the AZ where AWS has capacity. + +Fix: Create a new subnet in the AZ where capacity exists and add it to the cluster configuration. + +### Cluster Stuck in "Creating" (No Progress) + +1. Check `list-cluster-events` for error messages +2. If no events: likely waiting for capacity +3. If events show failures: fix the indicated issue +4. If stuck >1 hour with no events: contact AWS Support + +### Partial Provisioning (Some Nodes Running, Others Failing) + +This typically means capacity was available for some instances but not all. + +- The cluster will keep retrying if `NodeProvisioningMode=Continuous` +- Check events for the specific instance group that's failing +- Consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md new file mode 100644 index 00000000..4be45465 --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md @@ -0,0 +1,148 @@ +# CloudFormation Error Reference for HyperPod Deployments + +Deep-dive companion to the main [SKILL.md](../SKILL.md) § H (CloudFormation Errors). When deploying HyperPod via the SageMaker console or CloudFormation templates, failures surface as `CREATE_FAILED` or `ROLLBACK_COMPLETE` at the top-level stack. The actual root cause is usually buried several levels deep in nested stacks. + +--- + +## Navigating Nested Stacks + +### Stack Hierarchy (Console Deployments) + +Typical HyperPod console deployment creates this stack structure: + +``` +Top-Level Stack (HyperPod-) +├── NetworkStack (VPC, subnets, IGW, NAT, SG, S3 endpoint) +├── StorageStack (FSx Lustre, optional OpenZFS) +├── IAMStack (execution role, instance profile) +├── S3Stack (lifecycle scripts bucket + upload) +└── ClusterStack (AWS::SageMaker::Cluster resource) + └── [The cluster resource itself — most failures end here] +``` + +### Step-by-Step Navigation + +1. **CloudFormation Console** → ensure correct region → find the HyperPod stack +2. **Status filter:** look for `CREATE_FAILED` or `ROLLBACK_COMPLETE` +3. **Events tab** → filter by `CREATE_FAILED` → note the earliest failure timestamp +4. **Resources tab** → find `AWS::CloudFormation::Stack` type entries with `CREATE_FAILED` +5. **Click Physical ID** of the failed nested stack +6. **Repeat** until reaching a stack with only leaf resources (no further `AWS::CloudFormation::Stack`) +7. **Read Status Reason** on the failed leaf resource — this is the root cause + +### Tip: Find Root Cause via CLI + +```bash +# List all failed events across all stacks (requires stack name or ID): +aws cloudformation describe-stack-events \ + --stack-name \ + --region \ + --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{Time:Timestamp,Resource:LogicalResourceId,Type:ResourceType,Reason:ResourceStatusReason}' \ + --output table + +# For nested stacks — get the nested stack's name from Resources tab: +aws cloudformation describe-stack-events \ + --stack-name \ + --region \ + --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]' +``` + +--- + +## Resource Error Catalog + +### AWS::SageMaker::Cluster + +| Status Reason | Root Cause | Fix | +| ----------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------- | +| `Insufficient capacity in the Availability Zone` | No on-demand instances available in AZ | Add subnet in different AZ; use Flexible Training Plans or reserved capacity | +| `No subnets in the capacity AZ` | Cluster subnet not in the AZ where capacity exists | Create subnet in the AZ where instances are available | +| `EFA health checks did not run successfully` | Security group missing self-referencing rules | Add inbound + outbound self-ref rules on SG (protocol: All, source: self) | +| `Lifecycle scripts did not run successfully` | Script error, S3 access, or timeout | Check CloudWatch logs: `/aws/sagemaker/Clusters//` | +| `Instance bootstrap failed due to network misconfiguration` | VPC routing or SG issue | Verify NAT Gateway route, S3 VPC endpoint, SG rules | +| `The security group 'sg-xxx' does not exist` | SG ID is wrong or in different region | Verify SG exists in the same region and VPC | +| `The subnet 'subnet-xxx' does not exist` | Subnet ID is wrong or in different region | Verify subnet exists in the same region | +| `You are not authorized to perform this operation` | Execution role missing permissions | Add `AmazonSageMakerClusterInstanceRolePolicy` + VPC permissions | +| `The maximum number of instances ... has been reached` | Service quota exceeded | Request quota increase via Service Quotas console | + +### AWS::IAM::Role + +| Status Reason | Root Cause | Fix | +| ----------------------------------------- | ------------------------------------- | ---------------------------------------------------------- | +| `Cannot exceed quota for PoliciesPerRole` | Too many managed policies attached | Consolidate inline policies; limit is 10 managed per role | +| `Invalid principal in policy` | Trust policy references wrong service | Use `"Service": "sagemaker.amazonaws.com"` in trust policy | +| `MalformedPolicyDocument` | JSON syntax error in inline policy | Validate JSON; check for trailing commas, missing quotes | +| `EntityAlreadyExists` | Role name already taken | Use unique name or import existing role | + +### AWS::EC2::VPC / Subnet / SecurityGroup + +| Status Reason | Root Cause | Fix | +| ---------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------ | +| `The CIDR 'x.x.x.x/y' conflicts with another subnet` | Overlapping CIDR in same VPC | Use non-overlapping CIDR blocks | +| `The maximum number of VPCs has been reached` | VPC quota per region (default: 5) | Request VPC quota increase | +| `InvalidGroup.Duplicate` | SG rule already exists | Skip — not a real error (idempotency issue in template) | +| `RulesPerSecurityGroupLimitExceeded` | More than 60 inbound or 60 outbound rules | Consolidate rules; use CIDR ranges instead of individual IPs | + +### AWS::FSx::FileSystem + +| Status Reason | Root Cause | Fix | +| ----------------------------------------------- | --------------------------------------- | ---------------------------------------------------- | +| `The subnet is not in a supported AZ` | FSx Lustre not available in subnet's AZ | Use a subnet in an AZ that supports FSx Lustre | +| `The security group does not belong to the VPC` | SG and subnet in different VPCs | Move SG or subnet to same VPC | +| `Insufficient storage capacity` | FSx Lustre capacity exhausted in AZ | Try different AZ or reduce storage size | +| `Invalid deployment type for storage type` | Template uses incompatible FSx config | PERSISTENT_2 requires SSD; check template parameters | + +### AWS::Lambda::Function (Custom Resources) + +| Status Reason | Root Cause | Fix | +| ------------------------------------------------ | ------------------------------------ | --------------------------------------------------------- | +| `` (Custom::Resource) | Lambda-backed custom resource failed | Find the Lambda function name → check its CloudWatch logs | +| `Timed out` | Lambda exceeded 15-minute limit | Custom resource handler is too slow; check what it does | + +**To debug Custom::Resource failures:** + +```bash +# Find Lambda function name from CFN Resources tab, then: +aws logs tail /aws/lambda/ --region --since 1h +``` + +--- + +## Rolled-Back Stacks + +When a stack rolls back, CloudFormation deletes the resources it created. To view rolled-back stacks: + +1. CloudFormation Console → **Deleted** filter (top-right dropdown) +2. Or via CLI: + + ```bash + aws cloudformation list-stacks \ + --stack-status-filter ROLLBACK_COMPLETE DELETE_COMPLETE \ + --region \ + --query 'StackSummaries[?contains(StackName,`HyperPod`) || contains(StackName,`hyperpod`)].{Name:StackName,Status:StackStatus,Time:CreationTime}' \ + --output table + ``` + +--- + +## CFN Template Gotchas + +### ThreadsPerCore + +`ThreadsPerCore` defaults to 1 (hyperthreading disabled) when set via console "Advanced Configuration." This makes p5.48xlarge show 96 vCPU instead of 192. Fix: set `ThreadsPerCore: 2` explicitly. + +Any `UpdateCluster` call via CFN **must include ThreadsPerCore** even if not originally set — omitting it resets to default. + +### S3 Bucket Naming + +The `SourceS3Uri` must match pattern `s3://sagemaker-*` per API validation. CFN templates typically create a bucket named `sagemaker-lifecycle-`. + +### Condition-Dependent Resources + +If using the reference HyperPod CFN template, some resources are conditional: + +- FSx OpenZFS: only created if `CreateOpenZFS=true` +- S3 VPC Endpoint: only created if `CreateS3Endpoint=true` +- SSM Session Document: only if `CreateSSMSessionDocument=true` + +A condition evaluating to `false` means the resource is skipped (not failed). diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md new file mode 100644 index 00000000..1c613dd4 --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md @@ -0,0 +1,664 @@ +# Cluster Diagnostics — Detailed Procedures + +This file contains the full diagnostic and fix procedures for each section referenced +in the main [SKILL.md](../SKILL.md). Jump to a section using the anchors below. + +--- + +## A: EFA Health Checks + +**Signals:** `"EFA health checks did not run successfully. Ensure that your VPC and security groups are properly configured before attempting to create a new cluster."` + +**Root cause:** Security group is missing a self-referencing rule that allows nodes to communicate with each other via EFA. This is the #1 most common cluster creation failure. + +### Diagnose + +```bash +# The diagnostic script auto-checks SG rules. You can also run directly: +bash scripts/diagnose-cluster.sh --cluster --region + +# Or check a specific security group: +SG=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'VpcConfig.SecurityGroupIds[0]' --output text) + +aws ec2 describe-security-groups --group-ids $SG --region \ + --query 'SecurityGroups[0].{Inbound:IpPermissions,Outbound:IpPermissionsEgress}' \ + --output json +``` + +Look for: self-referencing rules where Source/Destination is the security group itself. + +### Fix + +Add the required rules to **every** security group used by the cluster: + +```bash +SG= +REGION= + +# Rule 1 — Inbound self-reference (required for inter-node communication) +aws ec2 authorize-security-group-ingress --group-id $SG --region $REGION \ + --ip-permissions '[{"IpProtocol":"-1","UserIdGroupPairs":[{"GroupId":"'"$SG"'"}]}]' + +# Rule 2 — Outbound self-reference (required for EFA RDMA traffic) +aws ec2 authorize-security-group-egress --group-id $SG --region $REGION \ + --ip-permissions '[{"IpProtocol":"-1","UserIdGroupPairs":[{"GroupId":"'"$SG"'"}]}]' + +# Rule 3 — Outbound internet (required for AWS API calls, package downloads) +aws ec2 authorize-security-group-egress --group-id $SG --region $REGION \ + --ip-permissions '[{"IpProtocol":"-1","IpRanges":[{"CidrIp":"0.0.0.0/0"}]}]' +``` + +After fixing: verify with `describe-security-groups`, ensure all nodes use the same SG, then **retry cluster creation**. See [cluster-operations.md](cluster-operations.md) § 1 for multi-SG clusters and verification details. + +--- + +## B: Capacity & AZ + +**Signals:** `"Insufficient capacity"`, `"We currently do not have sufficient capacity in the Availability Zone you requested"`, `"Cannot provision requested instances"`, `"No subnets in the capacity AZ"`. + +### Diagnose + +```bash +# Check which AZs have the instance type +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone \ + --filters "Name=instance-type,Values=" \ + --region \ + --query 'InstanceTypeOfferings[*].Location' --output table +``` + +### Fix + +1. **Try a different AZ** — add a subnet where the instance type is available +2. **Flexible Training Plans** (recommended for p4d/p5/trn1) — `aws sagemaker search-training-plan-offerings`, then set `TrainingPlanArn` in cluster config +3. **Reserved capacity** — contact AWS account team for large/long-term needs + +If using reserved capacity and still failing: verify subnet AZ matches reservation AZ. See [cluster-operations.md](cluster-operations.md) § 2 for the condensed workflow and [capacity-planning.md](capacity-planning.md) for the full strategy guide (On-Demand vs. Flexible Training Plans vs. ODCR, AZ-ID selection, subnet IP sizing per instance type, relevant service quotas). + +--- + +## C: Lifecycle Scripts + +**Signals:** `"Lifecycle scripts did not run successfully"`, `"Lifecycle scripts execution timed out"`, cluster creation fails during provisioning. + +### Diagnose + +```bash +# Get cluster ID for CloudWatch log group +CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'ClusterArn' --output text) +CLUSTER_ID=$(echo "$CLUSTER_ARN" | cut -d/ -f2) +LOG_GROUP="/aws/sagemaker/Clusters//${CLUSTER_ID}" + +# List lifecycle log streams +aws logs describe-log-streams \ + --log-group-name "$LOG_GROUP" \ + --region \ + --query 'logStreams[?starts_with(logStreamName,`LifecycleConfig`)].logStreamName' \ + --output table + +# Read a specific log stream +aws logs get-log-events \ + --log-group-name "$LOG_GROUP" \ + --log-stream-name "LifecycleConfig//" \ + --region \ + --query 'events[*].message' --output text +``` + +### Common Errors & Fixes + +| Log Error | Root Cause | Fix | +| ---------------------------------------- | ------------------------------ | ----------------------------------------------------------------------------------------- | +| `Connect timeout on endpoint URL: s3://` | No S3 access from VPC | Add S3 Gateway VPC endpoint to subnet route table | +| `AccessDenied` on S3 | Missing IAM permissions | Add `s3:GetObject` + `s3:ListBucket` to execution role for the lifecycle script S3 bucket | +| Script never exits / timeout | Infinite loop or hung command | Add proper exit codes; test script locally; add `set -e` to fail fast | +| `ASCII text, with CRLF line terminators` | Windows line endings | Convert: `dos2unix script.sh` before uploading to S3 | +| `provisioning_parameters.json mismatch` | Instance group name mismatch | Match instance group names exactly between lifecycle script and API call | +| `command not found` | Missing dependency | Check if required packages are in the AMI; install in script | +| `Permission denied` | Missing shebang or permissions | Add `#!/bin/bash` as first line; ensure `chmod +x` before S3 upload | + +Compare scripts with latest upstream versions — see [cluster-operations.md](cluster-operations.md) § 3 for repo links, testing tips, and execution order. For the full S3 layout, `config.py` toggle reference, per-node-type detection flow, and on-node debug procedures (`/var/log/provision/`, `resource_config.json`), see [lifecycle-scripts.md](lifecycle-scripts.md). + +--- + +## D: EKS Access / kubectl + +**Signals:** `"couldn't get current server API group list: the server has asked for the client to provide credentials"`, kubectl auth errors, `kubectl get nodes` returns nothing or errors. + +**Root cause:** IAM identity not configured in EKS access entries, or kubeconfig not set up. + +### Diagnose + +```bash +# Step 1: Check your IAM identity +aws sts get-caller-identity + +# Step 2: Get EKS cluster name from HyperPod +EKS_ARN=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'Orchestrator.Eks.ClusterArn' --output text) +EKS_NAME=$(echo $EKS_ARN | awk -F'/' '{print $NF}') +echo "EKS cluster: $EKS_NAME" + +# Step 3: Check existing access entries +aws eks list-access-entries --cluster-name $EKS_NAME --region + +# Step 4: Check auth mode +aws eks describe-cluster --name $EKS_NAME --region \ + --query 'cluster.accessConfig.authenticationMode' --output text +# Must be API or API_AND_CONFIG_MAP — not CONFIG_MAP +``` + +### Fix + +```bash +# Step 1: Add your IAM identity to EKS access entries +MY_ARN=$(aws sts get-caller-identity --query 'Arn' --output text) + +# For IAM users: +aws eks create-access-entry \ + --cluster-name $EKS_NAME \ + --region \ + --principal-arn $MY_ARN + +# Step 2: Associate admin policy (for full cluster access) +aws eks associate-access-policy \ + --cluster-name $EKS_NAME \ + --region \ + --principal-arn $MY_ARN \ + --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \ + --access-scope '{"type": "cluster"}' + +# Step 3: Update kubeconfig +aws eks update-kubeconfig --name $EKS_NAME --region + +# Step 4: Test access +kubectl get nodes +kubectl get pods -A +``` + +**If auth mode is CONFIG_MAP** (not supported by HyperPod): change to `API_AND_CONFIG_MAP` via `aws eks update-cluster-config --name $EKS_NAME --access-config authenticationMode=API_AND_CONFIG_MAP`. For IAM roles, use the role ARN (not session ARN). See [cluster-operations.md](cluster-operations.md) § 4 for auth mode details. + +--- + +## E: Cluster Provisioning + +**Signals:** Cluster shows `InService` but instances are not visible, `kubectl get nodes` returns no nodes, `list-cluster-nodes` shows fewer nodes than expected. + +**Root cause:** This is often expected behavior with **Continuous Provisioning mode** (EKS only). In this mode the cluster transitions to InService before all instances are created. Instance creation happens asynchronously and failures are reported via cluster events, not as cluster creation failures. + +### Diagnose + +```bash +# Step 1: Check cluster status and provisioning mode +aws sagemaker describe-cluster --cluster-name --region \ + --query '{Status:ClusterStatus,Groups:InstanceGroups[*].{Name:InstanceGroupName,Count:CurrentCount,Target:InstanceCount,Status:InstanceGroupStatus}}' \ + --output table + +# Step 2: Check cluster events (EKS — primary source of truth) +aws sagemaker list-cluster-events --cluster-name --region \ + --query 'ClusterEventSummaries[*].{Time:EventTime,Type:EventType,Message:Message}' \ + --output table + +# Step 3: Check individual node status +aws sagemaker list-cluster-nodes --cluster-name --region \ + --query 'ClusterNodeSummaries[*].{ID:InstanceId,Group:InstanceGroupName,Status:InstanceStatus.Status}' \ + --output table +``` + +### Common Scenarios + +| Observation | Cause | Action | +| --------------------------------------------------------- | ----------------------------------------- | ---------------------------------------------------- | +| CurrentCount < InstanceCount, events show provisioning | Continuous provisioning — still creating | Wait; monitor events | +| Events show `"Insufficient capacity"` | No capacity in AZ | See **[B: Capacity & AZ](#b-capacity--az)** | +| Events show lifecycle script failure | Script error during instance provisioning | See **[C: Lifecycle Scripts](#c-lifecycle-scripts)** | +| Events show `"EFA health checks"` | SG misconfiguration | See **[A: EFA Health Checks](#a-efa-health-checks)** | +| No events, no nodes | Cluster may be stuck | Check CloudFormation stack; contact Support | +| Nodes in `list-cluster-nodes` but not `kubectl get nodes` | EKS registration issue | Check lifecycle script logs, kubelet status via SSM | + +See [cluster-operations.md](cluster-operations.md) § 5 for Continuous Provisioning details (EKS only). + +--- + +## F: SSM Connectivity + +**Signals:** `"Target is not connected"`, SSM session fails to start, cannot access nodes. + +### Diagnose + +```bash +# Step 1: Verify SSM plugin installed +session-manager-plugin --version + +# Step 2: Get the correct target format +# Target format: sagemaker-cluster:_- +# Do NOT use the EC2 instance ID directly! + +CLUSTER_INFO=$(aws sagemaker describe-cluster --cluster-name --region ) +CLUSTER_ID=$(echo "$CLUSTER_INFO" | python3 -c "import sys,json; print(json.load(sys.stdin)['ClusterArn'].split('/')[-1])") + +aws sagemaker list-cluster-nodes --cluster-name --region \ + --query 'ClusterNodeSummaries[*].{ID:InstanceId,Group:InstanceGroupName,Status:InstanceStatus.Status}' \ + --output table + +# Step 3: Construct target and test +TARGET="sagemaker-cluster:${CLUSTER_ID}_-" +aws ssm start-session --target "$TARGET" --region +``` + +### Required IAM Permissions for SSM + +```json +{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Action": [ + "sagemaker:DescribeCluster", + "sagemaker:ListClusterNodes", + "ssm:StartSession", + "ssm:TerminateSession" + ], + "Resource": "*" + }] +} +``` + +### Common Errors & Fixes + +| Error | Root Cause | Fix | +| --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `SessionManagerPlugin is not found` | SSM plugin not installed | Install: `brew install --cask session-manager-plugin` (macOS) or download from AWS docs (Linux). Verify: `session-manager-plugin --version` | +| `Target is not connected` | Wrong target format, wrong region, or node not running | Use `sagemaker-cluster:` prefix (CLUSTER_ID is the ARN suffix, not cluster name); verify region; check node is `Running` | +| `InvalidTarget` / `ValidationException` | Malformed target string | Format must be `sagemaker-cluster:_-` exactly | +| `Access denied` | Missing IAM permissions | Need `ssm:StartSession`, `sagemaker:DescribeCluster`, `sagemaker:ListClusterNodes` — see IAM policy above | +| Connection timeout | SSM agent unreachable | Check VPC endpoints (SSM, SSMMessages, EC2Messages) exist in the cluster VPC; verify node is `Running` | + +SSM access is **identical for both EKS and Slurm** clusters — same target format, same plugin, same IAM permissions, same VPC endpoints. + +For SSH-over-SSM setup, see [cluster-operations.md](cluster-operations.md) § 6. + +--- + +## G: Node Replacement + +**Signals:** Auto-replacement not triggering, `batch-replace-cluster-nodes` not working, node stuck in unhealthy state. + +### G.1: Auto-Replacement Not Working + +```bash +# Step 1: Check if NodeRecovery is enabled per instance group +aws sagemaker describe-cluster --cluster-name --region \ + --query 'InstanceGroups[*].{Group:InstanceGroupName,Recovery:NodeRecovery}' --output table + +# Step 1a: If NodeRecovery=None, enable it with update-cluster. All required fields +# for each instance group must be supplied (InstanceType/Count/LifeCycleConfig/ExecutionRole) — +# derive them from describe-cluster output first. +aws sagemaker update-cluster --cluster-name --region \ + --instance-groups '[{"InstanceGroupName":"","InstanceType":"ml.p5.48xlarge", + "InstanceCount":,"ThreadsPerCore":2, + "LifeCycleConfig":{"SourceS3Uri":"","OnCreate":"