Robust memory allocation handling #606

DJMcNab · 2024-06-07T10:21:40Z

Fixes #366

This is work in progress:

Cancel the pipeline if a failure would likely occur
Readback the bump buffer (of two frames ago) before starting rendering a frame
Reallocate based on those buffers

Some potential followup:

Move rendering to its own thread; this would enable our use of Device::poll to not block the main thread.
- This likely interacts poorly with resizing, if not done carefully.

armansito

Thank you for taking a shot at this. I have a few observations:

render_to_surface polls the device, which can block. I understand that it only blocks on a submission that is from two frames prior: is this equivalent to triple buffering? I'm guessing this generally doesn't lead to stalls? Is having this logic internal to render_to_surface acceptable to clients or do we want to give them more control over the resizing and blocking behavior?
render_to_texture may also benefit from the same resize logic as not all applications are going to draw directly to a surface.
The submission tracking (previous and previouser) seems to assume that there is a regular stream of submissions. This is definitely the case when submissions are driven by the display refresh. There could be cases, such as a UI application, that only redraws content occasionally on a state change. In that scenario, an allocation failure may not get detected right away. I'm not sure what to do in that case.
Earlier we had talked about giving clients the ability to optionally enable buffer size estimation (which should have a 0 failure rate at the cost of CPU time). I think we may want to make resizing optional too. The size estimation currently can't accurately predict buffers like the PTCL and its best guess will always be very conservative. Perhaps estimation and resize can be used in combination.
You may want to think about when to shrink the buffers since the current logic will monotonically increase the memory footprint.

vello/src/lib.rs

vello/src/render.rs

vello_encoding/src/config.rs

armansito · 2024-06-12T15:21:39Z

vello_encoding/src/config.rs

+        let lines = BufferSize::new(1 << 15);
+        let seg_counts = BufferSize::new(1 << 15);
+        let segments = BufferSize::new(1 << 15);
+        let ptcl = BufferSize::new(1 << 17);


These sizes are smaller than before. I imagine this is to verify the resizing logic on the larger scenes?

Yeah - it's a combination of testing the smaller sizes, and also to try and reduce memory usage by default. See e.g. linebender/xilem#319

There definitely is room to change this, as tens of mebibytes of memory usage is probably not worth worrying about; the change was mostly made for testing, however

vello_shaders/shader/prepare.wgsl

armansito · 2024-06-12T15:33:34Z

vello_shaders/shader/flatten.wgsl

@@ -848,7 +850,7 @@ fn main(
                    let offset_tangent = offset * normalize(tangent);
                    let n = offset_tangent.yx * vec2f(-1., 1.);
                    draw_cap(path_ix, (style_flags & STYLE_FLAGS_START_CAP_MASK) >> 2u,
-                             pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform);
+                        pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform);


I think these indentations looked better earlier but I'm not strongly attached to that.

This wasn't intentional - wgsl-analyzer has a formatter, and I use format on save. I have since disabled that formatter for wgsl files, but wasn't aware that this had been changed. I'll restore it

armansito · 2024-06-12T15:37:48Z

vello_shaders/shader/tile_alloc.wgsl

@@ -41,7 +41,7 @@ fn main(
    // We need to check only prior stages, as if this stage has failed in another workgroup, 
    // we still want to know this workgroup's memory requirement.
    if local_id.x == 0u {
-        let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN)) != 0u;
+        let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN | PREVIOUS_RUN)) != 0u;


Shouldn't this simply abort if any prior stage failed? I know we have this explicit check in a few places but in practice we probably don't need to mask any of the bits before checking, especially now that prepare zeroes them out at the start.

I'm not sure what you mean here; we do "abort" on line 49, but we need to get this to all threads? Is there a specific wgsl abort operation?

Masking out the bits is explained in the pre-existing comment, above.

DJMcNab · 2024-06-12T16:16:22Z

Thanks for the review; luckily you picked up on some of the areas

* `render_to_surface` polls the device, which can block. I understand that it only blocks on a submission that is from two frames prior: is this equivalent to triple buffering? I'm guessing this generally doesn't lead to stalls? Is having this logic internal to `render_to_surface` acceptable to clients or do we want to give them more control over the resizing and blocking behavior?
* `render_to_texture` may also benefit from the same resize logic as not all applications are going to draw directly to a surface.

These decisions were made mostly because the API doesn't have a concept of a "thread" of rendering, such as a set of submissions which are related to/follow from each other. I chose the implement it on render_to_surface as our best proxy for that; if a client is doing render to surface, they're likely to be doing a pipeline, which is less true for render_to_texture (which may want success no matter how long it takes, but using the same scene).
I expect the end state here will need to involve explicit "client" or "target" management. One example of where that matters is multiple windows; you don't want to block on the first window finishing before submitting to the third window 😆

Yeah, the blocking isn't ideal; my follow-up work was mentioning using a separate thread to drive the rendering, i.e. probably a channel accepting a Scene to render, with some mechanism for recycling these scenes.
This blocking question also interacts with frame pacing, of course, and I need to be sure that these can share this essential complexity.

* The submission tracking (previous and previouser) seems to assume that there is a regular stream of submissions. This is definitely the case when submissions are driven by the display refresh. There could be cases, such as a UI application, that only redraws content occasionally on a state change. In that scenario, an allocation failure may not get detected right away. I'm not sure what to do in that case.

That is a very good point; I think the best option would be to start blocking on the next submission as soon as submission n-1 has finished. This also probably requires a render thread.

* Earlier we had talked about giving clients the ability to optionally enable buffer size estimation (which should have a 0 failure rate at the cost of CPU time). I think we may want to make resizing optional too. The size estimation currently can't accurately predict buffers like the PTCL and its best guess will always be very conservative. Perhaps estimation and resize can be used together in some cases.

Yeah, it's a tough one. I think we noted that estimation wasn't too expensive, but that probably needs careful measurement.

I'm going to keep working on this tomorrow, and will think about this more carefully then.

DJMcNab · 2024-07-01T09:00:50Z

A clarification on the status of this PR:

The CPU side API is known to be bad, and will need design.
The important thing this PR is contributing is the data-flow to/from the GPU in the common case of rendering to the CPU
This PR is not shippable (i.e. it should not be used in a release), but I think this is probably landable, and more improvements need to have this set of changes in mind
We're waiting on Raph's review to confirm that

@TrueDoctor

This brings in support for blend spilling (which was supported in the old piet-gpu). I don't have a good heuristic for how big to make the buffer. That is something which will need to be addressed in #606 (or its successors). I just guessed that 256 spills would be fine. I think this is probably too small - I suspect we'll get feedback from @TrueDoctor about this. I have confirmed that the robustness works as expected with the GPU shaders.

`rust-analyzer` will eagerly eat the error output otherwise

This should slightly reduce the amount of work required

Extracted from linebender#606

armansito reviewed Jun 12, 2024

View reviewed changes

DJMcNab requested a review from raphlinus July 1, 2024 08:58

DJMcNab marked this pull request as ready for review July 1, 2024 09:00

DJMcNab mentioned this pull request Aug 6, 2024

Allow deeper blend stacks by spilling to a buffer #657

Merged

DJMcNab added 14 commits August 7, 2024 11:56

Stash robustness thoughts

2dda200

Setup early exit in all stages

5e630b5

Give slightly better error output

4309b59

`rust-analyzer` will eagerly eat the error output otherwise

Fix CPU handling and remove empty file

123f60d

Get the hackiest version of robust memory working

dd506d5

Fix CPU runner and clippy

f07391d

Fixup some broken behaviour and add comments

0fd3097

Push some debug logging and a few misc improvements

fe2d6b9

Fixup handling to not be busted

cc01017

Move subset to a helper function

1928122

Move segments overflow checking to path_tiling_setup

947408e

This should slightly reduce the amount of work required

Fix compilation of bump estimator

cf62dae

Fix incorrect wgpu mixing

f504d12

Use the initial sizes to let tests work

692c7f8

DJMcNab force-pushed the robust-redux branch from 4cc4377 to 692c7f8 Compare August 7, 2024 11:16

DJMcNab added 4 commits August 7, 2024 12:34

Save some rebase cleanup

7d9c609

Properly reallocate blend spill

d2e9782

If allocation fails, don't try and download a fake lines buffer

d34f3db

Fix clippy issues with features

83a5c28

DJMcNab mentioned this pull request Aug 28, 2024

Full system hang on Apple M1 8GB #548

Closed

DJMcNab added a commit to DJMcNab/vello that referenced this pull request Aug 28, 2024

Properly write-protect segments

29f70b5

Extracted from linebender#606

DJMcNab mentioned this pull request Aug 28, 2024

Properly write-protect segments #673

Closed

DJMcNab mentioned this pull request Oct 21, 2024

Renderer stops functioning under stress without generating error messages #720

Open

DJMcNab mentioned this pull request Oct 21, 2024

Switch to Vello (new PR) lapce/floem#635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust memory allocation handling #606

Robust memory allocation handling #606

DJMcNab commented Jun 7, 2024 •

edited

Loading

armansito left a comment •

edited

Loading

armansito Jun 12, 2024 •

edited

Loading

DJMcNab Jun 12, 2024

armansito Jun 12, 2024

DJMcNab Jun 12, 2024

armansito Jun 12, 2024 •

edited

Loading

DJMcNab Jun 12, 2024

DJMcNab commented Jun 12, 2024

DJMcNab commented Jul 1, 2024

Robust memory allocation handling #606

Are you sure you want to change the base?

Robust memory allocation handling #606

Conversation

DJMcNab commented Jun 7, 2024 • edited Loading

armansito left a comment • edited Loading

Choose a reason for hiding this comment

armansito Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

DJMcNab Jun 12, 2024

Choose a reason for hiding this comment

armansito Jun 12, 2024

Choose a reason for hiding this comment

DJMcNab Jun 12, 2024

Choose a reason for hiding this comment

armansito Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

DJMcNab Jun 12, 2024

Choose a reason for hiding this comment

DJMcNab commented Jun 12, 2024

DJMcNab commented Jul 1, 2024

DJMcNab commented Jun 7, 2024 •

edited

Loading

armansito left a comment •

edited

Loading

armansito Jun 12, 2024 •

edited

Loading

armansito Jun 12, 2024 •

edited

Loading