Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust memory allocation handling #606

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

DJMcNab
Copy link
Member

@DJMcNab DJMcNab commented Jun 7, 2024

Fixes #366

This is work in progress:

  • Cancel the pipeline if a failure would likely occur
  • Readback the bump buffer (of two frames ago) before starting rendering a frame
  • Reallocate based on those buffers

Some potential followup:

  • Move rendering to its own thread; this would enable our use of Device::poll to not block the main thread.
    • This likely interacts poorly with resizing, if not done carefully.

Copy link
Collaborator

@armansito armansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a shot at this. I have a few observations:

  • render_to_surface polls the device, which can block. I understand that it only blocks on a submission that is from two frames prior: is this equivalent to triple buffering? I'm guessing this generally doesn't lead to stalls? Is having this logic internal to render_to_surface acceptable to clients or do we want to give them more control over the resizing and blocking behavior?

  • render_to_texture may also benefit from the same resize logic as not all applications are going to draw directly to a surface.

  • The submission tracking (previous and previouser) seems to assume that there is a regular stream of submissions. This is definitely the case when submissions are driven by the display refresh. There could be cases, such as a UI application, that only redraws content occasionally on a state change. In that scenario, an allocation failure may not get detected right away. I'm not sure what to do in that case.

  • Earlier we had talked about giving clients the ability to optionally enable buffer size estimation (which should have a 0 failure rate at the cost of CPU time). I think we may want to make resizing optional too. The size estimation currently can't accurately predict buffers like the PTCL and its best guess will always be very conservative. Perhaps estimation and resize can be used in combination.

  • You may want to think about when to shrink the buffers since the current logic will monotonically increase the memory footprint.

vello/src/lib.rs Outdated Show resolved Hide resolved
vello/src/lib.rs Show resolved Hide resolved
vello/src/render.rs Outdated Show resolved Hide resolved
vello_encoding/src/config.rs Show resolved Hide resolved
vello_encoding/src/config.rs Outdated Show resolved Hide resolved
let lines = BufferSize::new(1 << 15);
let seg_counts = BufferSize::new(1 << 15);
let segments = BufferSize::new(1 << 15);
let ptcl = BufferSize::new(1 << 17);
Copy link
Collaborator

@armansito armansito Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sizes are smaller than before. I imagine this is to verify the resizing logic on the larger scenes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's a combination of testing the smaller sizes, and also to try and reduce memory usage by default. See e.g. linebender/xilem#319

There definitely is room to change this, as tens of mebibytes of memory usage is probably not worth worrying about; the change was mostly made for testing, however

vello_shaders/shader/prepare.wgsl Outdated Show resolved Hide resolved
@@ -848,7 +850,7 @@ fn main(
let offset_tangent = offset * normalize(tangent);
let n = offset_tangent.yx * vec2f(-1., 1.);
draw_cap(path_ix, (style_flags & STYLE_FLAGS_START_CAP_MASK) >> 2u,
pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform);
pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these indentations looked better earlier but I'm not strongly attached to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't intentional - wgsl-analyzer has a formatter, and I use format on save. I have since disabled that formatter for wgsl files, but wasn't aware that this had been changed. I'll restore it

@@ -41,7 +41,7 @@ fn main(
// We need to check only prior stages, as if this stage has failed in another workgroup,
// we still want to know this workgroup's memory requirement.
if local_id.x == 0u {
let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN)) != 0u;
let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN | PREVIOUS_RUN)) != 0u;
Copy link
Collaborator

@armansito armansito Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this simply abort if any prior stage failed? I know we have this explicit check in a few places but in practice we probably don't need to mask any of the bits before checking, especially now that prepare zeroes them out at the start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here; we do "abort" on line 49, but we need to get this to all threads? Is there a specific wgsl abort operation?

Masking out the bits is explained in the pre-existing comment, above.

@DJMcNab
Copy link
Member Author

DJMcNab commented Jun 12, 2024

Thanks for the review; luckily you picked up on some of the areas

* `render_to_surface` polls the device, which can block. I understand that it only blocks on a submission that is from two frames prior: is this equivalent to triple buffering? I'm guessing this generally doesn't lead to stalls? Is having this logic internal to `render_to_surface` acceptable to clients or do we want to give them more control over the resizing and blocking behavior?
* `render_to_texture` may also benefit from the same resize logic as not all applications are going to draw directly to a surface.

These decisions were made mostly because the API doesn't have a concept of a "thread" of rendering, such as a set of submissions which are related to/follow from each other. I chose the implement it on render_to_surface as our best proxy for that; if a client is doing render to surface, they're likely to be doing a pipeline, which is less true for render_to_texture (which may want success no matter how long it takes, but using the same scene).
I expect the end state here will need to involve explicit "client" or "target" management. One example of where that matters is multiple windows; you don't want to block on the first window finishing before submitting to the third window 😆

Yeah, the blocking isn't ideal; my follow-up work was mentioning using a separate thread to drive the rendering, i.e. probably a channel accepting a Scene to render, with some mechanism for recycling these scenes.
This blocking question also interacts with frame pacing, of course, and I need to be sure that these can share this essential complexity.

* The submission tracking (previous and previouser) seems to assume that there is a regular stream of submissions. This is definitely the case when submissions are driven by the display refresh. There could be cases, such as a UI application, that only redraws content occasionally on a state change. In that scenario, an allocation failure may not get detected right away. I'm not sure what to do in that case.

That is a very good point; I think the best option would be to start blocking on the next submission as soon as submission n-1 has finished. This also probably requires a render thread.

* Earlier we had talked about giving clients the ability to optionally enable buffer size estimation (which should have a 0 failure rate at the cost of CPU time). I think we may want to make resizing optional too. The size estimation currently can't accurately predict buffers like the PTCL and its best guess will always be very conservative. Perhaps estimation and resize can be used together in some cases.

Yeah, it's a tough one. I think we noted that estimation wasn't too expensive, but that probably needs careful measurement.

I'm going to keep working on this tomorrow, and will think about this more carefully then.

@DJMcNab DJMcNab requested a review from raphlinus July 1, 2024 08:58
@DJMcNab
Copy link
Member Author

DJMcNab commented Jul 1, 2024

A clarification on the status of this PR:

  1. The CPU side API is known to be bad, and will need design.
  2. The important thing this PR is contributing is the data-flow to/from the GPU in the common case of rendering to the CPU
  3. This PR is not shippable (i.e. it should not be used in a release), but I think this is probably landable, and more improvements need to have this set of changes in mind
  4. We're waiting on Raph's review to confirm that

@DJMcNab DJMcNab marked this pull request as ready for review July 1, 2024 09:00
github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2024
This brings in support for blend spilling (which was supported in the
old piet-gpu).

I don't have a good heuristic for how big to make the buffer. That is
something which will need to be addressed in #606 (or its successors). I
just guessed that 256 spills would be fine. I think this is probably too
small - I suspect we'll get feedback from @TrueDoctor about this.

I have confirmed that the robustness works as expected with the GPU
shaders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strategy for robust dynamic memory, readback, and async
2 participants