Bump cubecl to use wgpu 26 #3657

janhohenheim · 2025-09-02T18:26:47Z

Pull Request Template

Checklist

Confirmed that cargo run-checks command has been executed.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

tracel-ai/cubecl#850

Changes

Use wgpu 26 :)

Testing

CI and cargo run-checks

codecov · 2025-09-02T19:10:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.01%. Comparing base (0427901) to head (487f6bf).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3657      +/-   ##
==========================================
- Coverage   63.02%   63.01%   -0.01%     
==========================================
  Files        1042     1042              
  Lines      120602   120602              
==========================================
- Hits        76008    75998      -10     
- Misses      44594    44604      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

laggui · 2025-09-03T16:55:04Z

GPU CI test failures are unrelated to the wgpu 26 upgrade

  thread 'tests::cube_fusion::autodiff::f32_ty::ad_transpose::tests::should_diff_swap_dims' panicked at /home/agent/.cargo/git/checkouts/cubecl-058c47895211d464/e770e3e/crates/cubecl-runtime/src/tune/local.rs:155:26:
  Should run when selected by autotune.: Unknown("RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))")

This is an autotune + fusion bug. Locally I cannot reproduce, because autotune does not select the tunable that fails. It either selects the fallback or the simple unit matmul.

{"key":{"key":{"matmul_key":{"definition":{"m":2,"n":2,"k":2,"lhs_pow2_factor":1,"rhs_pow2_factor":1,"elem_lhs":{"Float":"F32"},"elem_rhs":{"Float":"F32"},"elem_out":{"Float":"F32"},"matrix_layout_lhs":"Contiguous","matrix_layout_rhs":{"MildlyPermuted":{"transposed":true,"batch_swap":false}}},"analysis":{"scale_global":"Small","kind":"General"}},"num_out_buffers":2,"num_ops":4},"checksum":"b460b2b5faab200d678115506a9ba0b1"},"value":{"fastest_index":1,"results":[{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fused<cubecl_wgpu::runtime::WgpuRuntime, u32, burn_cubecl_fusion::matmul::optimization::SimpleUnit>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":1,"computation":{"mean":{"secs":0,"nanos":1616039},"median":{"secs":0,"nanos":1622980},"variance":{"secs":0,"nanos":191},"min":{"secs":0,"nanos":853141},"max":{"secs":0,"nanos":2373904}}}},{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fallback<cubecl_wgpu::runtime::WgpuRuntime, u32>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":0,"computation":{"mean":{"secs":0,"nanos":2917332},"median":{"secs":0,"nanos":2896749},"variance":{"secs":0,"nanos":235},"min":{"secs":0,"nanos":1983155},"max":{"secs":0,"nanos":3886805}}}},{"Ok":{"name":"cubecl_runtime::tune::function_tunable::FunctionTunable<burn_cubecl_fusion::matmul::tune::tune_fused<cubecl_wgpu::runtime::WgpuRuntime, u32, burn_cubecl_fusion::matmul::optimization::DoubleUnit>, fn(burn_cubecl_fusion::tune::TuneInput<cubecl_wgpu::runtime::WgpuRuntime, burn_cubecl_fusion::matmul::optimization::MatmulOptimizationTuneArg<cubecl_wgpu::runtime::WgpuRuntime>>) -> core::result::Result<burn_cubecl_fusion::shared::trace::base::TuneOutput<cubecl_wgpu::runtime::WgpuRuntime>, alloc::string::String>>","index":4,"computation":{"mean":{"secs":0,"nanos":2499707},"median":{"secs":0,"nanos":3141134},"variance":{"secs":0,"nanos":3063},"min":{"secs":0,"nanos":713040},"max":{"secs":0,"nanos":5950330}}}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":{"Unknown":"RunnerError(LaunchError(Unable to launch matmul because a required feature is unavailable: Cmma on lhs Scalar(Float(F32)) rhs Scalar(Float(F32)) and output Scalar(Float(F32)) with shape m=16, n=16, k=8 not supported.\n\n))"}},{"Err":"Skip"},{"Err":"Skip"}]}}

The simple and double vecmat tunables get the same error, but are not selected (as expected).

{
  "Err": {
    "Unknown": "RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"
  }
},
{
  "Err": {
    "Unknown": "RunnerError(LaunchError(Unable to launch matmul because the config is invalid: \"Lhs and Rhs must have same line size, got lhs=1 and rhs=2\"\n))"
  }
},

laggui · 2025-09-04T13:48:54Z

Shared on discord, but adding here for context:

(hidden autotune cache log since it may not be the root)

``` { "key": { "key": { "matmul_key": { "definition": { "m": 2, "n": 2, "k": 2, "lhs_pow2_factor": 1, "rhs_pow2_factor": 1, "elem_lhs": { "Float": "F32" }, "elem_rhs": { "Float": "F32" }, "elem_out": { "Float": "F32" }, "matrix_layout_lhs": { "MildlyPermuted": { "transposed": true, "batch_swap": false } }, "matrix_layout_rhs": "Contiguous" }, "analysis": { "scale_global": "Small", "kind": "General" } }, "num_out_buffers": 0, "num_ops": 4 }, "checksum": "b460b2b5faab200d678115506a9ba0b1" }, "value": { "fastest_index": 1, "results": [ { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 1, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 2, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 3, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 5, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 6, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 7, "computation": { "mean": { "secs": 0, "nanos": 0 }, "median": { "secs": 0, "nanos": 0 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 0 }, "max": { "secs": 0, "nanos": 0 } } } }, { "Ok": { "name": "cubecl_runtime::tune::function_tunable::FunctionTunable, fn(burn_cubecl_fusion::tune::TuneInput>) -> core::result::Result, alloc::string::String>>", "index": 0, "computation": { "mean": { "secs": 0, "nanos": 5316 }, "median": { "secs": 0, "nanos": 5320 }, "variance": { "secs": 0, "nanos": 0 }, "min": { "secs": 0, "nanos": 5080 }, "max": { "secs": 0, "nanos": 5760 } } } }, { "Err": "Skip" }, { "Err": "Skip" }, { "Err": "Skip" } ] } } ```

Seems that the kernels don't actually return an error during autotune but clearly they didn't execute anything (all zero timings except the fallback). So one of the failing kernels was selected, but at runtime the actual error appears.

/edit: hmm it might not be only a timing issue according to the debug info I added to the CI in this run. Timings look OK, and some kernels use vecmat as the fastest index, but at runtime it fails. The same test doesn't fail in isolation because the vecmat matmul algo will failed during autotune setup (expected). But when you run the whole test suite, another test with the same matmul configs selects that algo (because the line sizes are ok in this case). So when executing the problematic test, it re-uses the selected algo from the cache but it breaks at runtime (unexpected).

laggui

Failing CI is unrelated to the wgpu 26 upgrade, so let's move forward with it

Bump cubecl to use wgpu 26

487f6bf

laggui added the ci:test-gpu When applied to a Pull Request execute the `test-gpu.yml` workflow. label Sep 2, 2025

laggui approved these changes Sep 4, 2025

View reviewed changes

laggui merged commit 5029a2d into tracel-ai:main Sep 4, 2025
12 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bump cubecl to use wgpu 26 #3657

Bump cubecl to use wgpu 26 #3657

janhohenheim commented Sep 2, 2025

Uh oh!

codecov bot commented Sep 2, 2025 •

edited

Loading

Uh oh!

laggui commented Sep 3, 2025

Uh oh!

laggui commented Sep 4, 2025 •

edited

Loading

Uh oh!

laggui left a comment

Uh oh!

Uh oh!

Uh oh!

Bump cubecl to use wgpu 26 #3657

Bump cubecl to use wgpu 26 #3657

Conversation

janhohenheim commented Sep 2, 2025

Pull Request Template

Checklist

Related Issues/PRs

Changes

Testing

Uh oh!

codecov bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

laggui commented Sep 3, 2025

Uh oh!

laggui commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laggui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 2, 2025 •

edited

Loading

laggui commented Sep 4, 2025 •

edited

Loading