Skip to content

Conversation

janhohenheim
Copy link
Contributor

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Related Issues/PRs

tracel-ai/cubecl#850

Changes

Use wgpu 26 :)

Testing

CI and cargo run-checks

@laggui laggui added the ci:test-gpu When applied to a Pull Request execute the `test-gpu.yml` workflow. label Sep 2, 2025
Copy link

codecov bot commented Sep 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.01%. Comparing base (0427901) to head (487f6bf).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3657      +/-   ##
==========================================
- Coverage   63.02%   63.01%   -0.01%     
==========================================
  Files        1042     1042              
  Lines      120602   120602              
==========================================
- Hits        76008    75998      -10     
- Misses      44594    44604      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@laggui
Copy link
Member

laggui commented Sep 3, 2025

GPU CI test failures are unrelated to the wgpu 26 upgrade

  thread 'tests::cube_fusion::autodiff::f32_ty::ad_transpose::tests::should_diff_swap_dims' panicked at /home/agent/.cargo/git/checkouts/cubecl-058c47895211d464/e770e3e/crates/cubecl-runtime/src/tune/local.rs:155:26:
  Should run when selected by autotune.: Unknown("RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))")

This is an autotune + fusion bug. Locally I cannot reproduce, because autotune does not select the tunable that fails. It either selects the fallback or the simple unit matmul.

{"key":{"key":{"matmul_key":{"definition":{"m":2,"n":2,"k":2,"lhs_pow2_factor":1,"rhs_pow2_factor":1,"elem_lhs":{"Float":"F32"},"elem_rhs":{"Float":"F32"},"elem_out":{"Float":"F32"},"matrix_layout_lhs":"Contiguous","matrix_layout_rhs":{"MildlyPermuted":{"transposed":true,"batch_swap":false}}},"analysis":{"scale_global":"Small","kind":"General"}},"num_out_buffers":2,"num_ops":4},"checksum":"b460b2b5faab200d678115506a9ba0b1"},"value":{"fastest_index":1,"results":[{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fused<cubecl_wgpu::runtime::WgpuRuntime, u32, burn_cubecl_fusion::matmul::optimization::SimpleUnit>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":1,"computation":{"mean":{"secs":0,"nanos":1616039},"median":{"secs":0,"nanos":1622980},"variance":{"secs":0,"nanos":191},"min":{"secs":0,"nanos":853141},"max":{"secs":0,"nanos":2373904}}}},{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fallback<cubecl_wgpu::runtime::WgpuRuntime, u32>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":0,"computation":{"mean":{"secs":0,"nanos":2917332},"median":{"secs":0,"nanos":2896749},"variance":{"secs":0,"nanos":235},"min":{"secs":0,"nanos":1983155},"max":{"secs":0,"nanos":3886805}}}},{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fused<cubecl_wgpu::runtime::WgpuRuntime, u32, burn_cubecl_fusion::matmul::optimization::DoubleUnit>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":4,"computation":{"mean":{"secs":0,"nanos":2499707},"median":{"secs":0,"nanos":3141134},"variance":{"secs":0,"nanos":3063},"min":{"secs":0,"nanos":713040},"max":{"secs":0,"nanos":5950330}}}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":"Skip"},{"Err":"Skip"}]}}

The simple and double vecmat tunables get the same error, but are not selected (as expected).

{
  "Err": {
    "Unknown": "RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"
  }
},
{
  "Err": {
    "Unknown": "RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"
  }
},

@laggui
Copy link
Member

laggui commented Sep 4, 2025

Shared on discord, but adding here for context:

(hidden autotune cache log since it may not be the root) ``` { "key": { "key": { "matmul_key": { "definition": { "m": 2, "n": 2, "k": 2, "lhs_pow2_factor": 1, "rhs_pow2_factor": 1, "elem_lhs": { "Float": "F32" }, "elem_rhs": { "Float": "F32" }, "elem_out": { "Float": "F32" }, "matrix_layout_lhs": { "MildlyPermuted": { "transposed": true, "batch_swap": false } }, "matrix_layout_rhs": "Contiguous" }, "analysis": { "scale_global": "Small", "kind": "General" } }, "num_out_buffers": 0, "num_ops": 4 }, "checksum": "b460b2b5faab200d678115506a9ba0b1" }, "value": { "fastest_index": 1, "results": [ { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 1, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 2, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 3, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 5, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 6, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 7, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 0, "computation": { "mean": { "secs": 0, "nanos": 5316 }, "median": { "secs": 0, "nanos": 5320 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 5080 }, "max": { "secs": 0, "nanos": 5760 } } } }, { "Err": "Skip" }, { "Err": "Skip" }, { "Err": "Skip" } ] } } ```

Seems that the kernels don't actually return an error during autotune but clearly they didn't execute anything (all zero timings except the fallback). So one of the failing kernels was selected, but at runtime the actual error appears.

/edit: hmm it might not be only a timing issue according to the debug info I added to the CI in this run. Timings look OK, and some kernels use vecmat as the fastest index, but at runtime it fails. The same test doesn't fail in isolation because the vecmat matmul algo will failed during autotune setup (expected). But when you run the whole test suite, another test with the same matmul configs selects that algo (because the line sizes are ok in this case). So when executing the problematic test, it re-uses the selected algo from the cache but it breaks at runtime (unexpected).

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing CI is unrelated to the wgpu 26 upgrade, so let's move forward with it

@laggui laggui merged commit 5029a2d into tracel-ai:main Sep 4, 2025
12 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:test-gpu When applied to a Pull Request execute the `test-gpu.yml` workflow.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants