Using global constant arrays produces slow Metal code #4367

jimblandy · 2021-10-14T22:08:36Z

We generate Metal Shading Language code that runs slowly on some Apple devices, for shaders that index a large global constant array. The generated code copies a global constant into a function-local array every time the function is called; if the constant is large, this is expensive.

For the following input:

var<private> constant_array: array<f32, 4> = array<f32, 4>(0.0, 0.707, 1.0, 0.707);

[[stage(fragment)]]
fn access_array([[location(0)]] i: i32) -> [[location(0)]] f32 {
   return constant_array[i];
}

We generate the following MSL:

...

struct type1 {
    float inner[4];
};
constant type1 const_type1_ = {0.0, 0.7070000171661377, 1.0, 0.7070000171661377};

...
fragment ... access_array(
  ...
) {
    type1 constant_array = const_type1_;
    const auto i = ...;
    float _e3 = constant_array.inner[i];
    ...
}

This declares a global constant const_type1_, which access_array copies into the local variable constant_array each time it's called. For this example it's probably not a big deal, but people use big data tables, so the copy can be expensive.

Naga introduces this copy because MSL doesn't support mutable global arrays. Metal Shading Language Specification 2.4 §4, "Address Spaces", says, "The address space for a variable at program scope must be constant." Further, §4.2 says, "Variables in program scope must be declared in the constant address space and initialized during the declaration statement."

WGSL suggests that people use let for global constants, but the author can't use let here, since WGSL can't index arrays unless they're in memory.

What Naga should do when generating Metal (and what other toolchains seem to do now) is recognize that constant_array is not actually ever assigned to, generate a global variable in the Metal constant address space, and refer to that directly from function bodies.

The text was updated successfully, but these errors were encountered:

jimblandy · 2021-10-15T00:02:27Z

Here's what Google's Tint does with the example above:

struct tint_array_wrapper {
  float arr[4];
};
...
float access_array_inner(int i, thread tint_array_wrapper* const tint_symbol_3) {
  return (*(tint_symbol_3)).arr[i];
}

fragment tint_symbol_2 access_array(tint_symbol_1 tint_symbol [[stage_in]]) {
  thread tint_array_wrapper tint_symbol_4 = {.arr={0.0f, 0.707000017f, 1.0f, 0.707000017f}};
  float const inner_result = access_array_inner(tint_symbol.i, &(tint_symbol_4));
  tint_symbol_2 wrapper_result = {};
  wrapper_result.value = inner_result;
  return wrapper_result;
}

The semantics expressed are certainly run-time initialization, like Naga. But of course, since the issue here is inconsistent optimization (in some cases, Naga's output was fine), I don't think we can really say whether this is better or worse.

It seems safe to assume, though, that if we generate a global constant initialized at compile time, it's unlikely that any code generator along the path to the GPU would introduce run-time copies.

kvark · 2021-10-15T12:43:56Z

Thank you for writing this down!

type1 constant_array = const_type1_;

I'm not actually sure if that's a problem. I thought the problem was with code we inject to copy an array element-by-element, but this here is a bulk copy. Obviously can be improved, but I'd like us to not lose sight of what the actual issue is. A test case, maybe within https://github.com/kvark/wgpu-bench, would help.

jimblandy · 2021-10-16T19:36:30Z

Yes, a test case would be good. I was trying to capture what was described as the problem in matrix chat.

benjamin-kramer · 2023-08-07T14:07:32Z

On a shader I'm using, this issue causes a 3x slowdown (from ~700us to 2ms).
The constant array I'm using is 64-floats-long.
I'm not sure exactly what causes the slowdown, but if one can believe Xcode's GPU trace, this is both the definition in the fragment shader (30%), and the actual access to this array (6%).

teoxoy · 2023-08-07T14:36:41Z

Are you on the latest version (0.13)? We got rid of the loop initialization in the latest release (in gfx-rs/naga#2331) which was thought to be the cause of the slowdown.

benjamin-kramer · 2023-08-07T15:03:43Z

Yes. The relevant generated code is:

struct type_6 {
    float inner[64];
};
float foo(
    ...
    thread type_6& kNormalRandomNumbers
) {
    ...
    float _e9 = kNormalRandomNumbers.inner[__some_dynamic_index__];
    ...
}
fragment mainFragmentOutput mainFragment(
    ...
) {
    ...
    type_6 kNormalRandomNumbers = type_6 {0.18074971, ..., 0.64232767};
    ...
    ... = foo(..., kNormalRandomNumbers);
    ...
}

If instead the array is defined as:

static constant float kNormalRandomNumbers[] = {
  0.18074972, ...,  0.64232764
};

And is accessed directly by foo (instead of receiving the array as a parameter), the running time of the shader function is improved from 2ms to 700us.

teoxoy · 2023-08-07T19:08:17Z

Would defining the array as a global const in your wgsl source resolve the issue?

A var<private> (assuming that's what you are using in your source wgsl for the array) is mutable memory and I'd imagine (if the metal compiler is not doing any other optimizations) that the array will keep being copied for each fragment.

benjamin-kramer · 2023-08-08T07:12:39Z

It would, if I wouldn't need to access it with a dynamic index.
As it is, it doesn't seem possible to use a const array here.

teoxoy · 2023-08-08T11:56:44Z

Ah, forgot we don't support that yet. See #4337.

jimblandy self-assigned this Oct 14, 2021

jimblandy added area: naga back-end Outputs of naga shader conversion lang: Metal Metal Shading Language labels Oct 14, 2021

jimblandy mentioned this issue Oct 16, 2021

WGSL requires dynamic matrix indexing, which requires surprising SPIR-V gpuweb/gpuweb#1782

Closed

cwfitzgerald added the naga Shader Translator label Oct 25, 2023

cwfitzgerald transferred this issue from gfx-rs/naga Oct 25, 2023

teoxoy added this to the WebGPU Specification V1 milestone Nov 3, 2023

teoxoy added this to WebGPU for Firefox Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using global constant arrays produces slow Metal code #4367

Using global constant arrays produces slow Metal code #4367

jimblandy commented Oct 14, 2021 •

edited

Loading

jimblandy commented Oct 15, 2021

kvark commented Oct 15, 2021

jimblandy commented Oct 16, 2021

benjamin-kramer commented Aug 7, 2023 •

edited

Loading

teoxoy commented Aug 7, 2023

benjamin-kramer commented Aug 7, 2023 •

edited

Loading

teoxoy commented Aug 7, 2023

benjamin-kramer commented Aug 8, 2023

teoxoy commented Aug 8, 2023

Using global constant arrays produces slow Metal code #4367

Using global constant arrays produces slow Metal code #4367

Comments

jimblandy commented Oct 14, 2021 • edited Loading

jimblandy commented Oct 15, 2021

kvark commented Oct 15, 2021

jimblandy commented Oct 16, 2021

benjamin-kramer commented Aug 7, 2023 • edited Loading

teoxoy commented Aug 7, 2023

benjamin-kramer commented Aug 7, 2023 • edited Loading

teoxoy commented Aug 7, 2023

benjamin-kramer commented Aug 8, 2023

teoxoy commented Aug 8, 2023

jimblandy commented Oct 14, 2021 •

edited

Loading

benjamin-kramer commented Aug 7, 2023 •

edited

Loading

benjamin-kramer commented Aug 7, 2023 •

edited

Loading