-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust memory allocation handling #606
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking a shot at this. I have a few observations:
-
render_to_surface
polls the device, which can block. I understand that it only blocks on a submission that is from two frames prior: is this equivalent to triple buffering? I'm guessing this generally doesn't lead to stalls? Is having this logic internal torender_to_surface
acceptable to clients or do we want to give them more control over the resizing and blocking behavior? -
render_to_texture
may also benefit from the same resize logic as not all applications are going to draw directly to a surface. -
The submission tracking (previous and previouser) seems to assume that there is a regular stream of submissions. This is definitely the case when submissions are driven by the display refresh. There could be cases, such as a UI application, that only redraws content occasionally on a state change. In that scenario, an allocation failure may not get detected right away. I'm not sure what to do in that case.
-
Earlier we had talked about giving clients the ability to optionally enable buffer size estimation (which should have a 0 failure rate at the cost of CPU time). I think we may want to make resizing optional too. The size estimation currently can't accurately predict buffers like the PTCL and its best guess will always be very conservative. Perhaps estimation and resize can be used in combination.
-
You may want to think about when to shrink the buffers since the current logic will monotonically increase the memory footprint.
vello_encoding/src/config.rs
Outdated
let lines = BufferSize::new(1 << 15); | ||
let seg_counts = BufferSize::new(1 << 15); | ||
let segments = BufferSize::new(1 << 15); | ||
let ptcl = BufferSize::new(1 << 17); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sizes are smaller than before. I imagine this is to verify the resizing logic on the larger scenes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - it's a combination of testing the smaller sizes, and also to try and reduce memory usage by default. See e.g. linebender/xilem#319
There definitely is room to change this, as tens of mebibytes of memory usage is probably not worth worrying about; the change was mostly made for testing, however
@@ -848,7 +850,7 @@ fn main( | |||
let offset_tangent = offset * normalize(tangent); | |||
let n = offset_tangent.yx * vec2f(-1., 1.); | |||
draw_cap(path_ix, (style_flags & STYLE_FLAGS_START_CAP_MASK) >> 2u, | |||
pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform); | |||
pts.p0, pts.p0 - n, pts.p0 + n, -offset_tangent, transform); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these indentations looked better earlier but I'm not strongly attached to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't intentional - wgsl-analyzer
has a formatter, and I use format on save. I have since disabled that formatter for wgsl files, but wasn't aware that this had been changed. I'll restore it
@@ -41,7 +41,7 @@ fn main( | |||
// We need to check only prior stages, as if this stage has failed in another workgroup, | |||
// we still want to know this workgroup's memory requirement. | |||
if local_id.x == 0u { | |||
let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN)) != 0u; | |||
let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_FLATTEN | PREVIOUS_RUN)) != 0u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this simply abort if any prior stage failed? I know we have this explicit check in a few places but in practice we probably don't need to mask any of the bits before checking, especially now that prepare
zeroes them out at the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here; we do "abort" on line 49, but we need to get this to all threads? Is there a specific wgsl abort
operation?
Masking out the bits is explained in the pre-existing comment, above.
Thanks for the review; luckily you picked up on some of the areas
These decisions were made mostly because the API doesn't have a concept of a "thread" of rendering, such as a set of submissions which are related to/follow from each other. I chose the implement it on Yeah, the blocking isn't ideal; my follow-up work was mentioning using a separate thread to drive the rendering, i.e. probably a channel accepting a
That is a very good point; I think the best option would be to start blocking on the next submission as soon as submission n-1 has finished. This also probably requires a render thread.
Yeah, it's a tough one. I think we noted that estimation wasn't too expensive, but that probably needs careful measurement. I'm going to keep working on this tomorrow, and will think about this more carefully then. |
A clarification on the status of this PR:
|
This brings in support for blend spilling (which was supported in the old piet-gpu). I don't have a good heuristic for how big to make the buffer. That is something which will need to be addressed in #606 (or its successors). I just guessed that 256 spills would be fine. I think this is probably too small - I suspect we'll get feedback from @TrueDoctor about this. I have confirmed that the robustness works as expected with the GPU shaders.
`rust-analyzer` will eagerly eat the error output otherwise
This should slightly reduce the amount of work required
Extracted from linebender#606
Fixes #366
This is work in progress:
Some potential followup:
Device::poll
to not block the main thread.