-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Surface#configure and Surface#get_current_texture non-fatal #6253
base: trunk
Are you sure you want to change the base?
Make Surface#configure and Surface#get_current_texture non-fatal #6253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great overall, definitely an improvement over status quo!
Comments are mostly cosmetic and discussion of future improvements.
Err(err) => self.handle_error_fatal(err, "Surface::get_current_texture_view"), | ||
Err(err) => match surface_data.error_sink.lock().as_ref() { | ||
Some(error_sink) => { | ||
self.handle_error_nolabel(error_sink, err, "Surface::get_current_texture_view"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could forward this error as-is. We might add more status codes for this in the future or enhance the state that is Unknown
right now to carry more information 🤔. E.g. "not configured" is an error but not a status right now which seems strange to me.
What irks me in particular about this is that according to your reports you are seeing spurious errors here with many of your users which implies that going to the error sink is the wrong thing to do: a reported error should imply application-implementor error rather than driver/end-user error. Meaning it's something that one would be expected to dynamically handle as an application-implementor. Note also that handle_error_*
will cause a panic unless a custom error handler is installed.
... or are the errors you're seeing with your users exclusively about configure
?
To clarify, we don't have to solve this here all the way. This is more of a sentiment / open question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crash we're seeing right now is that we occasionally seem to get "Parent Device is Lost" validation errors when calling Surface#configure
and Surface#get_current_texture
. This only seems to be an issue with NVIDIA devices.
I've attached an example stacktrace of that below:
The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.
What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture
already returns a Result
. Which means in the case we return SurfaceError::Unknown
the function returns an error but you really need to go use the error callbacks to get the full details on the error to determine how to handle it.
What do you think is the next step here? I think what is in this PR is ok (and acceptable) but would love to brainstorm how to make it more ergonomic. We can't return Ok
(because we don't have a SurfaceTexture
) but we could return an actual Err
with the details of what happened (instead of just SurfaceError::Unknown
). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for sharing those callstacks. This kind of tracking is incredibly valuable. We do some of that at Rerun as well but our userbase is a lot smaller at this point and a lot of the traffic is blocked (and it's easy to opt out), we don't have that good of an insight.
The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.
100% agree the more we can forward directly and still be WebGPU compatible the better!
Seeing the picture a bit better by now, I think I actually have to row back a bit on my statements about that we can't return Results at all:
- whenever webgpu spec says that something throws on the content timeline (the green marked stuff), we can catch those failures and convert them to a Result
- the "queue up the error" case is for every failure on the device timeline and that is indeed most of the validation and why unfortunately a ton of things can't be Results and instead are asynchronously reported errors
- whenever a failure (or better, a family of errors) occurs that can only happen with wgpu-core and never with the js webgpu api, then we can return a result
get_current_texture in particular has practically no failure case other than some resources being null which afaik isn't possible in our Rust interface.
What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result
Yes agree again. With the above in mind about what failures are possible in wgpu-core vs webgpu this makes a bit more sense.
I think what is in this PR is ok (and acceptable)
Yep, definitely an improvement. Let's not get perfect in the way of good 😄
We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D
I think we should go over all surface relateed results & error codes again across the crate and unify them as much as possible. I haven't thought it through yet, but I think we could have status codes only show up as part of usable textures and everything else be an error with no status code at all.
Right now there's a lot of weird transformation going on between the wgpu interface, the wgpu-core context (still inside the wgpu crate) and the wgpu-core crate which has the actual implementation of the surfaces. Ideally we just pass things through as-is.
Not a lot of prior art to this unfortunately. There should probably more places in wgpu where we give back some limited Result - grepping for throw
in the webgpu spec to see where the content timeline can fail yields a handful of results and they should probably all translate into Result
. Very few do today, buffer mapping has some iirc.
/// The surface status is not known. | ||
Unknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if Unknown
is really the right term here. We know that get_current_texture
failed for a known reason when we get this. But then again we don't know what state our surface is in at this point (the error may have caused mayhem such that reconfigure is in order or the next get_current_texture
call might just succeed).
So maybe it's enough to just clarify this here a little bit:
/// The surface status is not known. | |
Unknown, | |
/// The surface status is not known since `get_current_texture` previously failed. | |
Unknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of clarifying via the comment. The reason I chose Unknown
as the name here is exactly what you mentioned: we don't really know the status of the surface when we get this error.
For example, if we get a DeviceLost
error we can't reliably report anything about the status of the Surface (since we got an error higher in the stack)
/// Acquiring a texture failed for an unknown reason | ||
Other, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar'ish suggestion SurfaceStatus::Unknown
: It's not really an unknown reason, we just haven't passed it through (yet?) 🤔
/// Acquiring a texture failed for an unknown reason | |
Other, | |
/// Acquiring a texture failed with a generic error. Check error callbacks for more information. | |
Other, |
string below needs adjusting as well.
Co-authored-by: Andreas Reich <[email protected]>
Thanks for the fast review! Will defer the remaining comment until #6253 (comment) is addressed, since I think that it impacts how it should be resolved. |
Connections
Link to the issues addressed by this PR, or dependent PRs in other repositories
#3586
Description
This PR attempts to fix API shortcomings in wgpu where
Surface#configure
andSurface#get_current_texture
return fatal errors.There are two primary changes:
For
Surface#configure
, usehandle_error_nolabel
instead ofhandle_error_fatal
. Warp (tens of thousands of daily users) has already made this change in a fork and has not had any issues.For
Surface#get_current_texture
change the call tohandle_error_fatal
to also usehandle_error_nolabel
. This function is a little trickier to support because we don't easily have access to anErrorSink
and we need to figure out what to return in the case of an error. For the former, I decided to store an optionalErrorSink
on theSurface
that is only set upon device configuration. I chose to do this so that wgpu consumers can actually use the device push/pop error scope functions to handle the error. I also decided to add newSurfaceStatus::Unknown
andSurfaceError::Other
variants to denote cases where we weren't able to derive the state of the surface due to an unknown.Assuming this approach generally looks good, I can address comments and add a changelog entry. Thank you for all of your hard work maintaining this repo!
Testing
I forced both of these functions to always return an error and confirmed that I'm now able to properly handle them as a consumer of this library.
Checklist
cargo fmt
.cargo clippy
. If applicable, add:--target wasm32-unknown-unknown
--target wasm32-unknown-emscripten
cargo xtask test
to run tests.CHANGELOG.md
. See simple instructions inside file.