| name | gpu-kernel-optimizer |
|---|---|
| description | End-to-end GPU kernel implementation and optimization router. Use this skill to turn PyTorch logic into a high-performance kernel or to systematically optimize an existing kernel. It initializes the local knowledge base, identifies the current phase, and routes work to baseline implementation, bottleneck analysis, and profile-driven optimization sub-skills. |
This skill owns the global constraints and selects the right sub-skill for each phase.
Any target-hardware spec value, including compute throughput, HBM bandwidth, cache/LDS/SMEM capacity and bandwidth, register counts, SM/CU count, warp or wavefront size, issue rate, occupancy limits, or any other microarchitecture limit, must be sourced from the local <gpu-wiki>/ knowledge base before use. Unsourced values invalidate the run and must be discarded.
- No fabrication: Do not use memory, estimates, model output, web snippets, or verbal user statements as hardware specs in
memory/v<N>.json,baseline_plan.md,baseline_report.md,plans/v<N>_plan.md, Roofline analysis, target calculations, or any archived decision. - Source every spec: Record every cited spec in this exact style:
Example:
<metric>: <value> <unit> <- <gpu-wiki>/<relative-path>:<line-or-section>H20 HBM3 bandwidth: 4.0 TB/s <- <gpu-wiki>/docs/hardware/h20_spec.md:L23. - Ownership: Authoritative hardware lookup, registration, and downstream decisions must happen in this workflow. Values not registered in the workspace
README.mdor currentplans/v<N>_plan.mdcannot be used as specs, performance targets, or Roofline inputs. - Fixed lookup flow:
- Read
<gpu-wiki>/README.mdfirst to find the hardware-spec index. - Search target-platform spec documents under
<gpu-wiki>/docs/,<gpu-wiki>/ref-docs/, and other indexed directories. - Register sources in archived files using the required format.
- Use the value only after registration.
- Read
- Missing wiki values: If a spec cannot be found, write
<metric>: UNKNOWN (gpu-wiki not found), record the gap inmemory/v<N>.jsonunderpitfalls_and_fixes, report it to the user, and ask whether a placeholder is acceptable. Do not fill gaps with wording such as "approximately", "should be", "usually", or "similar product". - Auditable archive: Any reviewer must be able to verify every spec from the local
<gpu-wiki>/path in the archive. Non-verifiable archives are invalid. - Profiling-driven optimization only:
- NVIDIA:
ncu - AMD:
./tools/profile_kernel.sh triton.testing.do_benchis the designated tool for end-to-end kernel latency measurement used in Stop Conditions evaluation and performance recording. This is a timing tool, not a profiler — it may determine whether the target is met, but must not replacencuorprofile_kernel.shfor identifying bottlenecks.
- NVIDIA:
- Step 0 computes the performance targets and writes them into
README.mdunderStop Conditions. - Optimization runs in a temporary workspace and every accepted iteration is committed with git.
- Masked memory: When reading
memory/v*.jsonfiles, check themaskedfield in each file. Skip any file wheremaskedistrue. Masked files are treated as discarded memory and must not influence planning, search deduplication, or optimization decisions. Themaskedfield defaults tofalseand can be set totrueby thegpu-kernel-partial-restartsub-skill or by the user manually.
- Baseline performance evaluation against theoretical limits
- Roofline peak FLOPS and peak bandwidth axes
- Memory-bound decisions and target comparisons
- TFLOPS, bandwidth, and absolute target comparisons for every iteration
- Occupancy, register pressure, LDS capacity, and other microarchitecture limits
- Any performance number written into archive files
Before entering any phase or sub-skill that uses hardware specs, confirm:
- Which hardware specs are needed for this task? Have they been sourced from
<gpu-wiki>/and registered with the required source format? - Does any archived spec value lack a gpu-wiki source? If yes, remove it or add the source immediately.
If any answer is no, complete the missing work before continuing.
Trigger this skill when the request asks to:
- Implement PyTorch logic as a high-performance GPU kernel
- Optimize an existing kernel
- Analyze kernel bottlenecks, Roofline behavior, or bandwidth utilization
- Move from baseline implementation to profile-driven optimization
Create an isolated optimization workspace under /tmp using the initialization script:
bash reference/workspace_init.sh <name> <kernel_demo_path>This script creates the workspace directory structure (memory/, plans/, profiles/), copies the kernel demo as kernel.py, initializes git, and creates .gitignore. See reference/workspace_init.sh for details.
After global constraints are confirmed and before writing the workspace README.md, parse configuration from the user prompt. Do not read the current directory README.md for configuration. All configuration must come from explicit user input or defaults.
Flow: parse user input -> Step 0 (hardware specs + Roofline analysis) -> write workspace README.md.
| Field | Description | Default |
|---|---|---|
platform |
Required. Target hardware platform, such as H20, H100, MI308X, or MI355X. | Ask the user if missing. |
arch |
Hardware architecture, derived from platform when possible. | H20/H100/H200 -> Hopper; MI300X/MI308X -> CDNA3; MI355X -> CDNA4. |
framework |
Required. Programming language/framework, such as CuteDSL or FlyDSL. | Ask the user if missing. |
gpu_wiki_path |
Local gpu-wiki path. Do not ask the user to confirm it. | /tmp/gpu-wiki/ |
reference_project |
Local reference-project path. | /tmp/reference-projects/ |
kernel_demo |
Required. Initial kernel implementation file to optimize. | Ask the user if missing. |
additional_notes |
Extra constraints, known bottlenecks, preferred directions, and edge cases. | none |
- Extract fields from the user prompt, for example "optimize on H20" ->
platform: H20, "write it in CuteDSL" ->framework: CuteDSL. - Derive
archfromplatformusing the mapping above. - Validate only the
kernel_demopath. - If
platform,framework, orkernel_demois missing, ask the user. Do not guess.
Run Step 0 immediately after parsing user input and before writing the workspace README.md.
Goal: use the target platform and kernel_demo to source hardware specs from gpu-wiki, perform theoretical Roofline analysis, compute absolute performance targets, and write them into README.md under Stop Conditions.
- Lookup hardware specs from gpu-wiki: Read
<gpu-wiki>/README.md, follow its indexes, and find exact target-platform specs such as peak TFLOPS, HBM bandwidth, L2/LDS capacity, and relevant bandwidths. Match the exact hardware; do not infer from similar products. Every spec must include a gpu-wiki source and be written intoREADME.mdunderHardware Spec. - Analyze the kernel demo statically:
- Compute theoretical FLOPs and theoretical data movement in bytes.
- Compute
Arithmetic Intensity = FLOPs / Bytes, compare with the Roofline ridge point, and classify the kernel as compute-bound or memory-bound. - Compute absolute targets as
hardware peak * 90%, preferring gpu-wiki measured maxima when available, otherwise using documented hardware specs. - Write the Roofline analysis to
README.md, including sourced specs, calculation process, bound classification, and absolute targets such ascompute-bound target >= 185.4 TFLOPSormemory-bound target >= 3.87 TB/s. - Copy the targets to
README.mdunderStop Conditions.
Completion criteria:
- Hardware specs, Roofline analysis, and
Stop Conditionsare written into workspaceREADME.md. - Bound type is determined.
- Absolute targets are computed.
Then proceed directly to writing the workspace README.md.
Before entering any sub-skill, write the initial session constraints into the workspace README.md. The workspace README.md stores static configuration parsed from user input, including Task Context and ISA Optimization Targets (previously in memory.md). memory/ stores structured iteration data as per-version JSON files (memory/v<N>.json). Files with masked: true are skipped during reads. README.md and the unmasked memory/v*.json files are the source of truth for every later iteration.
Fill README.md using ./reference/README.md. Unknown fields must be TBD.
python tools/memory_manager.py init --workspace /tmp/kernel_opt_<name>Use tools/memory_manager.py for all memory JSON operations (create, read, update, mask/unmask, summary). See python tools/memory_manager.py --help for full usage.
If constraints change during the session, such as shape changes or relaxed thresholds, update README.md immediately. Do not leave changes only in the conversation.
Run the following phases in order. A phase must pass before the next phase starts.
Sub-skill: gpu-kernel-baseline
Goal: understand the PyTorch logic, extract compute pattern, input/output shapes, dtype, dependencies, and accuracy requirements; learn the target framework API and hardware constraints from <gpu-wiki>/README.md; then implement correct kernel.py and test_kernel.py, validate correctness and baseline performance, and create the starting point for profile-driven optimization.
The main agent must launch a subagent for Stage 1. The main agent must not implement the baseline directly. The subagent must read and follow gpu-kernel-baseline, read the PyTorch logic, learn framework APIs via <gpu-wiki>/README.md, implement kernel.py and test_kernel.py, validate correctness and performance, write baseline_report.md, write memory/v0.json, and commit.
Subagent requirements:
- Task type: editing task.
- Required inputs: workspace path,
README.md,memory/directory, PyTorch logic orkernel_demo, platform, framework, dtype, shapes, and correctness threshold. - Must do: read
gpu-kernel-baseline; read workspaceREADME.md; implementkernel.pyandtest_kernel.pybase on CuteDSL or FlyDSL; run correctness and baseline performance validation; writebaseline_report.md; writememory/v0.json; commit with git. - Forbidden: do not skip
<gpu-wiki>/README.md; do not fabricate hardware specs; do not modify Stage 2 plans or profiles; do not commit if correctness fails. - Return: paths for
kernel.pywhich implemented by CuteDSL or FlyDSL,test_kernel.py, andbaseline_report.md; maximumrel_err; baseline performance; git commit hash; unresolved issues.
Entry criteria:
README.mdexists and includes Step 0 hardware specs, Roofline analysis, andStop Conditions.- Platform, framework, dtype, shapes, and correctness threshold are clear.
- PyTorch logic,
kernel_demo, or reference code path is clear.
Exit criteria:
- Stage 1 subagent returned results.
kernel.pyexists and must be implemented based on CuteDSL or FlyDSL.test_kernel.pypasses and records maxrel_errplus PASS/FAIL.- Baseline performance is recorded in
memory/v0.json. baseline_report.mdexists.- Git commit is complete.
Then the main agent takes over and must enter Stage 2 immediately. It is forbidden to stop, summarize final deliverables, or exit the workflow after Stage 1 unless Stage 2 has also completed or the user explicitly asks to stop.
Sub-skill: gpu-kernel-profile-optimizer
Helper skill: gpu-kernel-bottleneck-analysis
Goal: use Step 0 Roofline conclusions and multiple profile -> code change -> validation loops to approach the performance limit.
Stage 2 researches, plans, searches gpu-wiki/reference projects, writes an optimization plan, profiles, modifies code, validates correctness, applies quality gates, commits, and writes memory/v<N>.json. It must continually compare against ISA optimization targets recorded in README.md.
Entry criteria: Stage 1 passed and README.md contains Step 0 Roofline analysis and Stop Conditions.
Exit condition: performance reaches the absolute target in README.md under Stop Conditions.
When the exit condition is met, stop optimization and summarize deliverables.
- PyTorch logic only: Step 0 -> Stage 1 -> Stage 2
- Existing kernel with "why is it slow": Step 0 -> Stage 2
- Roofline analysis only: Step 0 only
All sub-skills share top-level tools/:
tools/compute_utilization.pytools/bench_bandwidth.pytools/measure_bandwidth_ceiling.pytools/measure_kernel_time.pytools/extract_asm.pytools/profile_kernel.shtools/memory_manager.py
reference/workspace_init.sh— workspace initialization script used in Startup phase.reference/README.md— workspaceREADME.mdtemplate.reference/plan.md— optimization plan template.reference/v_iteration.schema.json— iteration JSON schema.reference/profile_guide.md— consolidated profile tool usage guide (ncu for NVIDIA, rocprofv3/ATT/PMC for AMD), sourced from gpu-wiki. Covers commands, key metrics, evidence extraction, SASS/ASM analysis, and troubleshooting.