openinfer-project · xiaguan · Jun 11, 2026 · Jun 9, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/docs/index.md b/docs/index.md
@@ -121,6 +121,7 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l
 | `subsystems/kernels/openinfer-kernels-boundary.md` | Architecture decision: openinfer should use reusable frontend/runtime/data-plane layers plus per-model engines; kernels become first-class assets through a ledger, simulator, and request tracing. |
 | `subsystems/kernels/kernel-op-reports.md` | Qwen3 kernel/report tooling is feature-gated: `qwen3_kernel_report` covers per-op kernel reports, and `qwen3_model_report` emits runtime-traced eager-DAG decode operator rollups with TensorSpec `KernelCall`s, latency stats, tables, and Graphviz DOT; measured FA2 `CTA_TILE_Q=64` prefill default in place. |
 | `subsystems/kernels/typed-forward-pipeline.md` | Reusable typed tensor pipeline macro in `openinfer-kernels` so model crates can express common `typed_ops` chains without model-specific wrapper macros. |
+| `subsystems/kernels/tvm-ffi-mvp.md` | Optional `tvm-ffi-triton-cubin` bridge in `openinfer-kernels` plus a packed TVM wrapper for the Qwen3.5 GDR solve Triton AOT CUBIN launcher. |
 
 ## playbooks
 

diff --git a/docs/subsystems/kernels/tvm-ffi-mvp.md b/docs/subsystems/kernels/tvm-ffi-mvp.md
@@ -0,0 +1,80 @@
+# TVM FFI Triton CUBIN Wrapper
+
+> **TL;DR:** `openinfer-kernels` has an optional `tvm-ffi-triton-cubin` bridge for the Qwen3.5 GDR solve Triton AOT CUBIN launcher, with unit coverage for wrapper registration and packed-ABI diagnostics.
+>
+> **Last touched:** 2026-06
+
+## Preparation
+
+- **Read**:
+  - `docs/index.md` - routed this task to the kernels subsystem.
+  - `docs/subsystems/kernels/openinfer-kernels-boundary.md` - confirmed DSL/kernel integration belongs at the kernels boundary rather than in model runtimes.
+  - `docs/subsystems/kernels/kernel-op-reports.md` - confirmed Triton/CuTe tooling is already feature-scoped in kernel infrastructure.
+  - `openinfer-kernels/tools/triton/README.md` - described the current Triton AOT CUBIN generation and validation path.
+  - `openinfer-kernels/build.rs` - showed generated Triton AOT C stubs and wrapper symbols.
+  - `openinfer-kernels/src/ffi/qwen35.rs` and `openinfer-kernels/src/ffi/shared.rs` - showed the existing C ABI launch symbols used by Rust model code.
+  - Local `tvm-ffi` crate source - confirmed typed callbacks only cover up to 8 arguments, so Triton launchers need packed TVM FFI wrappers.
+- **Relevant history**:
+  - GitHub issue `#191` proposed TVM FFI as the DSL interface direction.
+  - Draft PR `#202` kept TVM FFI optional/test-only; PR `#315` keeps the bridge optional behind `tvm-ffi-triton-cubin` while focusing it on Triton CUBIN launch wrappers.
+- **Plan**:
+  1. Add `tvm-ffi` as an optional dependency of `openinfer-kernels` behind `tvm-ffi-triton-cubin`.
+  2. Add a `triton_cubin` module that exposes a current Qwen3.5 Triton AOT CUBIN launcher as a packed TVM FFI function.
+  3. Keep existing C ABI and Rust call sites available; the TVM FFI layer is an additional DSL boundary, not a production scheduler/model migration.
+  4. Add a small example that registers the wrapper and prints the function contract.
+  5. Validate formatting and the strongest local build/test checks available.
+- **Risks / open questions**:
+  - The `tvm-ffi-triton-cubin` feature means `tvm-ffi-config` and `libtvm_ffi` are build prerequisites only for the optional bridge path.
+  - The wrapper depends on `qwen35-4b` because the wrapped Triton AOT symbol is only generated with that feature.
+  - The current wrapper accepts raw device pointer and stream handles as TVM integers or opaque pointers; a future DLPack/tensor-handle wrapper can sit on top once the DSL artifact contract is stable.
+
+## Execution Log
+
+### Step 1: Optional dependency and wrapper surface
+- Added optional `tvm-ffi = "0.1.0-alpha.0"` to `openinfer-kernels` behind `tvm-ffi-triton-cubin`.
+- Made `tvm-ffi-triton-cubin` imply `qwen35-4b`, since the current wrapper targets a Qwen3.5 Triton AOT CUBIN symbol.
+- Added `openinfer_kernels::triton_cubin`, which exposes metadata plus a packed TVM FFI callback for the generated Qwen3.5 GDR solve Triton AOT launcher.
+- Kept existing CUDA C ABI symbols and model call sites unchanged.
+
+### Step 2: Small example
+- Added `openinfer-kernels/examples/triton_cubin_tvm_ffi.rs` to register the TVM FFI global function and print the launch contract.
+
+### Step 3: Unit test coverage
+- Added wrapper unit tests for:
+  - known/unknown wrapper lookup;
+  - global TVM FFI registry round-trip;
+  - accepted raw handle encodings (`i64`, `u64`, and opaque pointer);
+  - accepted TVM `i64` scalar launch dimensions;
+  - missing-argument diagnostics before CUDA launch;
+  - handle and scalar type diagnostics before CUDA launch.
+- Kept tests on pre-launch validation paths so they do not require valid device memory or actually launch the Triton CUBIN.
+
+### Step 4: Review fixes
+- Addressed xiaguan's requested changes on PR `#315`:
+  - made `tvm-ffi` optional behind `tvm-ffi-triton-cubin` so normal `openinfer-kernels` builds do not require `tvm-ffi-config` / `libtvm_ffi`;
+  - replaced `expect_err(...)` in tests with explicit `Result` matching because `tvm_ffi::Any` does not implement `Debug`;
+  - updated the example and docs to require/pass the feature.
+- Addressed automated inline feedback by accepting TVM FFI packed integers as `i64` for pointer handles and scalar launch dimensions, with range checks before casting.
+
+### Step 5: Rebase onto main
+- Rebasing onto `origin/main` renamed the kernel crate from `pegainfer-kernels` to `openinfer-kernels` and added the `qwen35-4b` Triton feature gate.
+- Adapted the TVM bridge to the renamed crate, `openinfer_kernels` Rust import path, `openinfer.triton_cubin.*` TVM global prefix, and `OPENINFER_*` docs.
+- Rebase validation:
+  - `cargo fmt --all --check` passed.
+  - `cargo metadata --no-deps --format-version 1` passed.
+  - `cargo tree -p openinfer-kernels -e normal --no-default-features --depth 1` shows no `tvm-ffi`.
+  - `cargo tree -p openinfer-kernels -e normal --features tvm-ffi-triton-cubin --depth 1` shows `tvm-ffi` only with the bridge feature enabled.
+  - `cargo check --release -p openinfer-kernels` and `PATH=/home/ziyang/gpu_memory_profiling/.venv/bin:$PATH cargo test --release -p openinfer-kernels --features tvm-ffi-triton-cubin triton_cubin --lib -- --nocapture` both stop in the existing CUDA build before Rust checks run: FlashInfer `v0.6.12` headers require CUDA symbols not available from this local CUDA 12.8 toolchain (`cuda::fast_mod_div`, `cuda::maximum`, `cuda::minimum`).
+
+## Debrief
+
+- **Outcome**: Added optional TVM FFI dependency wiring plus a real Triton CUBIN wrapper MVP for the Qwen3.5 GDR solve launcher, with unit tests covering wrapper discovery, registry registration, packed handle conversion, and pre-launch diagnostics.
+- **Pitfalls encountered**:
+  - `apply_patch` and normal shell commands were blocked by the sandbox namespace failure, so edits were applied with scoped scripts/patches.
+  - TVM FFI is now a real build prerequisite only when `tvm-ffi-triton-cubin` is enabled; hosts using that feature need `tvm-ffi-config` on `PATH`.
+  - Local full kernel-crate validation is currently blocked by the pinned FlashInfer headers failing under the local CUDA 12.8 toolchain, not by the TVM FFI code.
+- **Lessons learned**:
+  - TVM FFI typed callbacks currently cover only up to 8 arguments, while Triton/CUDA launchers can exceed that, so the wrapper should use packed TVM FFI callbacks for launch surfaces.
+- **Follow-ups**:
+  - Add packed TVM FFI wrappers for the remaining generated Triton AOT launchers once the FlashInfer/CUDA toolchain gate is green.
+  - Consider a higher-level DLPack/tensor-handle wrapper above the raw pointer/stream packed ABI once the DSL artifact contract is stable.
diff --git a/openinfer-kernels/Cargo.toml b/openinfer-kernels/Cargo.toml
@@ -9,18 +9,24 @@ anyhow = { workspace = true }
 cudarc = { workspace = true }
 half = { workspace = true }
 serde = { workspace = true }
+tvm-ffi = { version = "0.1.0-alpha.0", optional = true }
 
 [build-dependencies]
 cc = { workspace = true }
 
 [features]
 default = []
+tvm-ffi-triton-cubin = ["dep:tvm-ffi", "qwen35-4b"]
 # Qwen3.5 Triton AOT kernels (GDR chunkwise prefill) — the only feature that
 # needs Python + Triton at build time.
 qwen35-4b = []
 deepseek-v4 = []
 deepseek-v4-cutedsl-diagnostic = ["deepseek-v4"]
 kimi-k2 = []
 
+[[example]]
+name = "triton_cubin_tvm_ffi"
+required-features = ["tvm-ffi-triton-cubin"]
+
 [lints]
 workspace = true
diff --git a/openinfer-kernels/examples/triton_cubin_tvm_ffi.rs b/openinfer-kernels/examples/triton_cubin_tvm_ffi.rs
@@ -0,0 +1,20 @@
+use openinfer_kernels::triton_cubin::{self, QWEN35_GDR_CHUNK_SOLVE};
+
+fn main() -> tvm_ffi::Result<()> {
+    triton_cubin::register_global_functions()?;
+
+    println!("registered Triton CUBIN TVM FFI functions:");
+    for spec in triton_cubin::TRITON_CUBIN_FUNCTIONS {
+        println!("  {} -> {}", spec.name, spec.ffi_symbol);
+    }
+
+    let solve = triton_cubin::get_global_or_register(QWEN35_GDR_CHUNK_SOLVE.name)?;
+    println!(
+        "{} is ready; call it with packed args: {}",
+        QWEN35_GDR_CHUNK_SOLVE.name,
+        QWEN35_GDR_CHUNK_SOLVE.arg_names.join(", ")
+    );
+
+    drop(solve);
+    Ok(())
+}
diff --git a/openinfer-kernels/src/lib.rs b/openinfer-kernels/src/lib.rs
@@ -7,4 +7,6 @@ pub mod gpu_buffers;
 pub mod ops;
 pub mod paged_kv;
 pub mod tensor;
+#[cfg(feature = "tvm-ffi-triton-cubin")]
+pub mod triton_cubin;
 pub mod typed_ops;