-
Notifications
You must be signed in to change notification settings - Fork 90
feat: Add ai-optimization skill for SageMaker AI Optimization APIs #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| --- | ||
| name: ai-optimization | ||
| description: Guides users through SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference. Covers workload configuration, benchmark jobs, and recommendation jobs that find the best instance type, optimization strategy, and serving configuration for a model. Use when the user says "benchmark my model", "optimize inference", "find the best instance", "recommendation job", "workload config", "AI benchmark", "AI recommendation", "reduce inference cost", "improve latency", or "optimize throughput". | ||
| metadata: | ||
| version: "1.0.0" | ||
| --- | ||
|
|
||
| # AI Optimization | ||
|
|
||
| Guide users through SageMaker AI Optimization APIs to benchmark LLM inference performance and get deployment recommendations. | ||
|
|
||
| ## Scope | ||
|
|
||
| This skill covers the **SageMaker AI Optimization** APIs, which help users: | ||
|
|
||
| - **Benchmark** an existing SageMaker endpoint to measure inference performance (latency, throughput, cost) | ||
| - **Get recommendations** for the best instance type, serving configuration, and optional optimizations (kernel tuning, speculative decoding) for deploying a model | ||
|
|
||
| ### Three Resource Types | ||
|
|
||
| | Resource | Purpose | | ||
| | ----------------------- | -------------------------------------------------------------------------------------------------------- | | ||
| | **AIWorkloadConfig** | Defines the traffic pattern (request shape, concurrency, dataset) for benchmarking | | ||
| | **AIBenchmarkJob** | Runs a benchmark against a live SageMaker endpoint using a workload config | | ||
| | **AIRecommendationJob** | Analyzes a model, deploys it on candidate instances, benchmarks each, and returns ranked recommendations | | ||
|
|
||
| ### 14 API Operations | ||
|
|
||
| | Resource | Create | Describe | Delete | List | Stop | | ||
| | ------------------- | ------ | -------- | ------ | ---- | ---- | | ||
| | AIWorkloadConfig | ✓ | ✓ | ✓ | ✓ | | | ||
| | AIBenchmarkJob | ✓ | ✓ | ✓ | ✓ | ✓ | | ||
| | AIRecommendationJob | ✓ | ✓ | ✓ | ✓ | ✓ | | ||
|
|
||
| ## Principles | ||
|
|
||
| 1. **One thing at a time.** Each response advances exactly one decision. | ||
| 2. **Confirm before proceeding.** Wait for the user to agree before moving to the next step. | ||
| 3. **Don't read files until you need them.** Only read reference files when you've reached the step that requires them. | ||
| 4. **Use what you know.** If the answer is in conversation history or any file you've already read, use it. | ||
| 5. **No narration.** Share outcomes and ask questions. Keep responses short. | ||
| 6. **Notebook writing.** Write notebooks using your standard file write tool to create the `.ipynb` file with the complete notebook JSON, OR use notebook MCP tools if available. Do NOT use bash commands to generate notebooks. | ||
|
|
||
| ## Workflow | ||
|
|
||
| ### Step 1: Determine the User's Goal | ||
|
|
||
| Check conversation history first. The user typically wants one of: | ||
|
|
||
| 1. **Benchmark an existing endpoint** — They already have a deployed model and want performance metrics. | ||
| 2. **Get deployment recommendations** — They have a model in S3 and want to know the best instance type and configuration. | ||
| 3. **Both** — Benchmark first, then optimize. | ||
|
|
||
| If unclear, ask: | ||
|
|
||
| > "What would you like to do? | ||
| > | ||
| > 1. **Benchmark** — Measure performance of an existing SageMaker endpoint | ||
| > 2. **Get recommendations** — Find the best instance type and configuration for a model in S3 | ||
| > | ||
| > Pick one, or describe what you're trying to achieve." | ||
|
|
||
| ⏸ Wait for user. | ||
|
|
||
| - If benchmark → go to Step 2A. | ||
| - If recommendations → go to Step 2B. | ||
|
|
||
| ### Step 2A: Benchmark an Existing Endpoint | ||
|
|
||
| Read `references/benchmark-workflow.md` and follow its instructions. | ||
|
|
||
| ### Step 2B: Get Deployment Recommendations | ||
|
|
||
| Read `references/recommendation-workflow.md` and follow its instructions. | ||
|
|
||
| ### Step 3: Review Results | ||
|
|
||
| After the job completes: | ||
|
|
||
| - For **benchmark jobs**: present the performance metrics (latency percentiles, throughput, cost estimates). | ||
| - For **recommendation jobs**: present the ranked recommendations with instance type, expected performance, and optimization details. | ||
|
|
||
| Read `references/interpreting-results.md` for guidance on presenting results to the user. | ||
|
|
||
| ### Step 4: Next Steps | ||
|
|
||
| After presenting results, offer relevant next steps: | ||
|
|
||
| > "What would you like to do next? | ||
| > | ||
| > - **Deploy the recommended configuration** — I can help create a SageMaker endpoint using the top recommendation | ||
| > - **Run another benchmark** — Test with different parameters or a different workload | ||
| > - **Compare results** — Run recommendations with different performance targets (cost vs latency vs throughput)" | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - **AWS credentials** configured (via AWS CLI, environment variables, or SageMaker Space) | ||
| - **IAM role** with SageMaker permissions (`AmazonSageMakerFullAccess` or equivalent) | ||
| - For benchmarking: a deployed SageMaker endpoint | ||
| - For recommendations: a model stored in S3 (HuggingFace format) | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues | ||
|
|
||
| | Issue | Cause | Fix | | ||
| | --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------- | | ||
| | Job stuck in Pending | No available capacity for the requested instance type | Try a different instance type or wait for capacity | | ||
| | Job failed with "ResourceLimitExceeded" | Account quota exceeded | Request a quota increase for the instance type | | ||
| | Benchmark metrics look wrong | Workload config doesn't match the model's capabilities | Adjust token counts and concurrency in the workload config | | ||
| | Recommendation job failed | Model format not supported or S3 path incorrect | Verify the model is in HuggingFace format and the S3 URI is correct | |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,89 @@ | ||||||
| # Benchmark Results Download | ||||||
|
|
||||||
| Generate a notebook cell that downloads and displays benchmark results. The output is stored as an `output.tar.gz` archive — the primary metrics file is `profile_export_aiperf.json`. | ||||||
|
|
||||||
| ```python | ||||||
| import io | ||||||
| import json | ||||||
| import tarfile | ||||||
| from urllib.parse import urlparse | ||||||
|
|
||||||
| import boto3 | ||||||
|
|
||||||
| # sm client is defined in a prior cell (Step 3) | ||||||
|
||||||
| # sm client is defined in a prior cell (Step 3) | |
| # sm client is defined in a prior cell (Step 2) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Benchmark Workflow | ||
|
|
||
| Guide the user through creating and running an AI Benchmark Job. | ||
|
|
||
| ## Step 1: Gather Endpoint Information | ||
|
Comment on lines
+1
to
+5
|
||
|
|
||
| You need: | ||
|
|
||
| - **Endpoint name** — The SageMaker endpoint to benchmark | ||
| - **Inference components** (optional) — If the endpoint uses inference components, which ones to target | ||
|
|
||
| If not already known, ask: | ||
|
|
||
| > "What's the name of the SageMaker endpoint you want to benchmark? If it uses inference components, let me know which ones to target." | ||
|
|
||
| Use the AWS MCP tool `describe-endpoint` to verify the endpoint exists and is InService. If the user specified inference components, also use `describe-inference-component` to verify they exist on the endpoint. | ||
|
|
||
| ## Step 2: Create a Workload Config | ||
|
|
||
| A workload config defines the traffic pattern. Key parameters: | ||
|
|
||
| | Parameter | Description | Default | | ||
| | ---------------------------- | -------------------------- | ------- | | ||
| | `prompt_input_tokens_mean` | Average input token count | 512 | | ||
| | `prompt_input_tokens_stddev` | Std dev of input tokens | 50 | | ||
| | `output_tokens_mean` | Average output token count | 256 | | ||
| | `output_tokens_stddev` | Std dev of output tokens | 30 | | ||
| | `concurrency` | Concurrent requests | 1 | | ||
| | `request_count` | Total requests to send | 100 | | ||
|
|
||
| Ask the user: | ||
|
|
||
| > "What's the typical input/output length (in tokens) and concurrency? Or I can use sensible defaults." | ||
|
|
||
| ⏸ Wait for user. | ||
|
|
||
| Generate a notebook cell that creates the workload config: | ||
|
|
||
| ```python | ||
| import boto3 | ||
| import json | ||
|
|
||
| sm = boto3.client("sagemaker") | ||
|
|
||
| workload_spec = { | ||
| "benchmark": {"type": "aiperf"}, | ||
| "parameters": { | ||
| "prompt_input_tokens_mean": 512, # Adjust based on user input | ||
| "prompt_input_tokens_stddev": 50, | ||
| "output_tokens_mean": 256, # Adjust based on user input | ||
| "output_tokens_stddev": 30, | ||
| "concurrency": 1, # Adjust based on user input | ||
| "request_count": 100, | ||
| }, | ||
| } | ||
|
|
||
| sm.create_ai_workload_config( | ||
| AIWorkloadConfigName="my-workload-config", | ||
| AIWorkloadConfigs={"WorkloadSpec": {"Inline": json.dumps(workload_spec)}}, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Step 3: Create the Benchmark Job | ||
|
|
||
| Generate a notebook cell that creates and monitors the benchmark job: | ||
|
|
||
| ```python | ||
| import time | ||
|
|
||
| sm.create_ai_benchmark_job( | ||
| AIBenchmarkJobName="my-benchmark-job", | ||
| AIWorkloadConfigIdentifier="my-workload-config", | ||
| RoleArn="<ROLE_ARN>", # User's IAM role | ||
| BenchmarkTarget={ | ||
| "Endpoint": {"Identifier": "<ENDPOINT_NAME>"} | ||
| }, | ||
| OutputConfig={ | ||
| "S3OutputLocation": "s3://<BUCKET>/benchmark-results/" | ||
| }, | ||
| ) | ||
|
|
||
| # Poll until complete (timeout after 1 hour) | ||
| MAX_WAIT = 3600 | ||
| start = time.time() | ||
| while time.time() - start < MAX_WAIT: | ||
| resp = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job") | ||
| status = resp["AIBenchmarkJobStatus"] | ||
| print(f"Status: {status} ({int(time.time() - start)}s elapsed)") | ||
| if status in ("Completed", "Failed", "Stopped"): | ||
| break | ||
| time.sleep(30) | ||
| else: | ||
| raise TimeoutError("Benchmark job did not complete within 1 hour") | ||
|
|
||
| if status == "Failed": | ||
| print(f"Benchmark failed: {resp.get('FailureReason', 'Unknown')}") | ||
| elif status == "Stopped": | ||
| print("Benchmark was stopped before completion.") | ||
| else: | ||
| print("Benchmark completed successfully.") | ||
| ``` | ||
|
|
||
| ## Step 4: Present Results | ||
|
|
||
| When the job completes, read `benchmark-results.md` for the code to download and display results. | ||
|
|
||
| Return to the main SKILL.md Step 3 (Review Results). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR adds a new skill (ai-optimization), it’s a new feature for the sagemaker-ai plugin, but the plugin manifest versions remain 1.1.0. Per docs/MAINTAINERS_GUIDE.md:60, please bump the plugin version (in both .claude-plugin/plugin.json and .codex-plugin/plugin.json) accordingly.