Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
| `subsystems/kernels/openinfer-kernels-boundary.md` | Architecture decision: openinfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
| `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
| `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `openinfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
| `subsystems/kernels/tvm-ffi-mvp.md` | Optional `tvm-ffi-triton-cubin` bridge in `openinfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |

## playbooks

Expand Down
80 changes: 80 additions & 0 deletions docs/subsystems/kernels/tvm-ffi-mvp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# TVM FFI Triton CUBIN Wrapper

> **TL;DR:** `openinfer-kernels` has an optional `tvm-ffi-triton-cubin` bridge for the Qwen3.5 GDR solve Triton AOT CUBIN launcher, with unit coverage for wrapper registration and packed-ABI diagnostics.
>
> **Last touched:** 2026-06

## Preparation

- **Read**:
- `docs/index.md` - routed this task to the kernels subsystem.
- `docs/subsystems/kernels/openinfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
- `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
- `openinfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
- `openinfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
- `openinfer-kernels/src/ffi/qwen35.rs` and `openinfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
- Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
- **Relevant history**:
- GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
- Draft PR `#202` kept TVM FFI optional/test-only; PR `#315` keeps the bridge optional behind `tvm-ffi-triton-cubin` while focusing it on Triton CUBIN launch wrappers.
- **Plan**:
1. Add `tvm-ffi` as an optional dependency of `openinfer-kernels` behind `tvm-ffi-triton-cubin`.
2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
4. Add a small example that registers the wrapper and prints the function contract.
5. Validate formatting and the strongest local build/test checks available.
- **Risks / open questions**:
- The `tvm-ffi-triton-cubin` feature means `tvm-ffi-config` and `libtvm_ffi` are build prerequisites only for the optional bridge path.
- The wrapper depends on `qwen35-4b` because the wrapped Triton AOT symbol is only generated with that feature.
- The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.

## Execution Log

### Step 1: Optional dependency and wrapper surface
- Added optional `tvm-ffi = "0.1.0-alpha.0"` to `openinfer-kernels` behind `tvm-ffi-triton-cubin`.
- Made `tvm-ffi-triton-cubin` imply `qwen35-4b`, since the current wrapper targets a Qwen3.5 Triton AOT CUBIN symbol.
- Added `openinfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
- Kept existing CUDA C ABI symbols and model call sites unchanged.

### Step 2: Small example
- Added `openinfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.

### Step 3: Unit test coverage
- Added wrapper unit tests for:
- known/unknown wrapper lookup;
- global TVM FFI registry round-trip;
- accepted raw handle encodings (`i64`, `u64`, and opaque pointer);
- accepted TVM `i64` scalar launch dimensions;
- missing-argument diagnostics before CUDA launch;
- handle and scalar type diagnostics before CUDA launch.
- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.

### Step 4: Review fixes
- Addressed xiaguan's requested changes on PR `#315`:
- made `tvm-ffi` optional behind `tvm-ffi-triton-cubin` so normal `openinfer-kernels` builds do not require `tvm-ffi-config` / `libtvm_ffi`;
- replaced `expect_err(...)` in tests with explicit `Result` matching because `tvm_ffi::Any` does not implement `Debug`;
- updated the example and docs to require/pass the feature.
- Addressed automated inline feedback by accepting TVM FFI packed integers as `i64` for pointer handles and scalar launch dimensions, with range checks before casting.

### Step 5: Rebase onto main
- Rebasing onto `origin/main` renamed the kernel crate from `pegainfer-kernels` to `openinfer-kernels` and added the `qwen35-4b` Triton feature gate.
- Adapted the TVM bridge to the renamed crate, `openinfer_kernels` Rust import path, `openinfer.triton_cubin.*` TVM global prefix, and `OPENINFER_*` docs.
- Rebase validation:
- `cargo fmt --all --check` passed.
- `cargo metadata --no-deps --format-version 1` passed.
- `cargo tree -p openinfer-kernels -e normal --no-default-features --depth 1` shows no `tvm-ffi`.
- `cargo tree -p openinfer-kernels -e normal --features tvm-ffi-triton-cubin --depth 1` shows `tvm-ffi` only with the bridge feature enabled.
- `cargo check --release -p openinfer-kernels` and `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p openinfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib -- --nocapture` both stop in the existing CUDA build before Rust checks run: FlashInfer `v0.6.12` headers require CUDA symbols not available from this local CUDA 12.8 toolchain (`cuda::fast_mod_div`, `cuda::maximum`, `cuda::minimum`).

## Debrief

- **Outcome**: Added optional TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
- **Pitfalls encountered**:
- `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped scripts/patches.
- TVM FFI is now a real build prerequisite only when `tvm-ffi-triton-cubin` is enabled; hosts using that feature need `tvm-ffi-config` on `PATH`.
- Local full kernel-crate validation is currently blocked by the pinned FlashInfer headers failing under the local CUDA 12.8 toolchain, not by the TVM FFI code.
- **Lessons learned**:
- TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
- **Follow-ups**:
- Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer/CUDA toolchain gate is green.
- Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
6 changes: 6 additions & 0 deletions openinfer-kernels/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,24 @@ anyhow = { workspace = true }
cudarc = { workspace = true }
half = { workspace = true }
serde = { workspace = true }
tvm-ffi = { version = "0.1.0-alpha.0", optional = true }

[build-dependencies]
cc = { workspace = true }

[features]
default = []
tvm-ffi-triton-cubin = ["dep:tvm-ffi", "qwen35-4b"]
# Qwen3.5 Triton AOT kernels (GDR chunkwise prefill) — the only feature that
# needs Python + Triton at build time.
qwen35-4b = []
deepseek-v4 = []
deepseek-v4-cutedsl-diagnostic = ["deepseek-v4"]
kimi-k2 = []

[[example]]
name = "triton_cubin_tvm_ffi"
required-features = ["tvm-ffi-triton-cubin"]

[lints]
workspace = true
20 changes: 20 additions & 0 deletions openinfer-kernels/examples/triton_cubin_tvm_ffi.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
use openinfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};

fn main() -> tvm_ffi::Result<()> {
triton_cubin::register_global_functions()?;

println!("registered Triton CUBIN TVM FFI functions:");
for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
println!(" {} -> {}", spec.name, spec.ffi_symbol);
}

let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
println!(
"{} is ready; call it with packed args: {}",
QWEN35_GDR_CHUNK_SOLVE.name,
QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
);

drop(solve);
Ok(())
}
2 changes: 2 additions & 0 deletions openinfer-kernels/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@ pub mod gpu_buffers;
pub mod ops;
pub mod paged_kv;
pub mod tensor;
#[cfg(feature = "tvm-ffi-triton-cubin")]
pub mod triton_cubin;
pub mod typed_ops;
Loading