Make Surface#configure and Surface#get_current_texture non-fatal #6253

alokedesai · 2024-09-11T01:02:45Z

Connections
Link to the issues addressed by this PR, or dependent PRs in other repositories
#3586

Description
This PR attempts to fix API shortcomings in wgpu where Surface#configure and Surface#get_current_texture return fatal errors.

There are two primary changes:

For Surface#configure, use handle_error_nolabel instead of handle_error_fatal. Warp (tens of thousands of daily users) has already made this change in a fork and has not had any issues.
For Surface#get_current_texture change the call to handle_error_fatal to also use handle_error_nolabel. This function is a little trickier to support because we don't easily have access to an ErrorSink and we need to figure out what to return in the case of an error. For the former, I decided to store an optional ErrorSink on the Surface that is only set upon device configuration. I chose to do this so that wgpu consumers can actually use the device push/pop error scope functions to handle the error. I also decided to add new SurfaceStatus::Unknown and SurfaceError::Other variants to denote cases where we weren't able to derive the state of the surface due to an unknown.

Assuming this approach generally looks good, I can address comments and add a changelog entry. Thank you for all of your hard work maintaining this repo!

Testing
I forced both of these functions to always return an error and confirmed that I'm now able to properly handle them as a consumer of this library.

Checklist

Run cargo fmt.
Run cargo clippy. If applicable, add:
- --target wasm32-unknown-unknown
- --target wasm32-unknown-emscripten
Run cargo xtask test to run tests.
Add change to CHANGELOG.md. See simple instructions inside file.

Wumpf

Looking great overall, definitely an improvement over status quo!
Comments are mostly cosmetic and discussion of future improvements.

wgpu/src/api/surface_texture.rs

Wumpf · 2024-09-11T16:42:38Z

wgpu/src/backend/wgpu_core.rs

-            Err(err) => self.handle_error_fatal(err, "Surface::get_current_texture_view"),
+            Err(err) => match surface_data.error_sink.lock().as_ref() {
+                Some(error_sink) => {
+                    self.handle_error_nolabel(error_sink, err, "Surface::get_current_texture_view");


I wish we could forward this error as-is. We might add more status codes for this in the future or enhance the state that is Unknown right now to carry more information 🤔. E.g. "not configured" is an error but not a status right now which seems strange to me.

What irks me in particular about this is that according to your reports you are seeing spurious errors here with many of your users which implies that going to the error sink is the wrong thing to do: a reported error should imply application-implementor error rather than driver/end-user error. Meaning it's something that one would be expected to dynamically handle as an application-implementor. Note also that handle_error_* will cause a panic unless a custom error handler is installed.

... or are the errors you're seeing with your users exclusively about configure?

To clarify, we don't have to solve this here all the way. This is more of a sentiment / open question.

The crash we're seeing right now is that we occasionally seem to get "Parent Device is Lost" validation errors when calling Surface#configure and Surface#get_current_texture. This only seems to be an issue with NVIDIA devices.

I've attached an example stacktrace of that below:

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result. Which means in the case we return SurfaceError::Unknown the function returns an error but you really need to go use the error callbacks to get the full details on the error to determine how to handle it.

What do you think is the next step here? I think what is in this PR is ok (and acceptable) but would love to brainstorm how to make it more ergonomic. We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

thanks for sharing those callstacks. This kind of tracking is incredibly valuable. We do some of that at Rerun as well but our userbase is a lot smaller at this point and a lot of the traffic is blocked (and it's easy to opt out), we don't have that good of an insight.

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

100% agree the more we can forward directly and still be WebGPU compatible the better!
Seeing the picture a bit better by now, I think I actually have to row back a bit on my statements about that we can't return Results at all:

whenever webgpu spec says that something throws on the content timeline (the green marked stuff), we can catch those failures and convert them to a Result

the "queue up the error" case is for every failure on the device timeline and that is indeed most of the validation and why unfortunately a ton of things can't be Results and instead are asynchronously reported errors

whenever a failure (or better, a family of errors) occurs that can only happen with wgpu-core and never with the js webgpu api, then we can return a result

get_current_texture in particular has practically no failure case other than some resources being null which afaik isn't possible in our Rust interface.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result

Yes agree again. With the above in mind about what failures are possible in wgpu-core vs webgpu this makes a bit more sense.

I think what is in this PR is ok (and acceptable)

Yep, definitely an improvement. Let's not get perfect in the way of good 😄

We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

I think we should go over all surface relateed results & error codes again across the crate and unify them as much as possible. I haven't thought it through yet, but I think we could have status codes only show up as part of usable textures and everything else be an error with no status code at all.
Right now there's a lot of weird transformation going on between the wgpu interface, the wgpu-core context (still inside the wgpu crate) and the wgpu-core crate which has the actual implementation of the surfaces. Ideally we just pass things through as-is.

Not a lot of prior art to this unfortunately. There should probably more places in wgpu where we give back some limited Result - grepping for throw in the webgpu spec to see where the content timeline can fail yields a handful of results and they should probably all translate into Result. Very few do today, buffer mapping has some iirc.

Wumpf · 2024-09-11T16:49:53Z

wgpu-types/src/lib.rs

+    /// The surface status is not known.
+    Unknown,


I'm wondering if Unknown is really the right term here. We know that get_current_texture failed for a known reason when we get this. But then again we don't know what state our surface is in at this point (the error may have caused mayhem such that reconfigure is in order or the next get_current_texture call might just succeed).

So maybe it's enough to just clarify this here a little bit:

Suggested change

/// The surface status is not known.

Unknown,

/// The surface status is not known since `get_current_texture` previously failed.

Unknown,

I like the idea of clarifying via the comment. The reason I chose Unknown as the name here is exactly what you mentioned: we don't really know the status of the surface when we get this error.

For example, if we get a DeviceLost error we can't reliably report anything about the status of the Surface (since we got an error higher in the stack)

Wumpf · 2024-09-11T17:00:11Z

wgpu/src/api/surface_texture.rs

+    /// Acquiring a texture failed for an unknown reason
+    Other,


Similar'ish suggestion SurfaceStatus::Unknown: It's not really an unknown reason, we just haven't passed it through (yet?) 🤔

Suggested change

/// Acquiring a texture failed for an unknown reason

Other,

/// Acquiring a texture failed with a generic error. Check error callbacks for more information.

Other,

string below needs adjusting as well.

Co-authored-by: Andreas Reich <[email protected]>

alokedesai · 2024-09-11T17:35:15Z

Thanks for the fast review! Will defer the remaining comment until #6253 (comment) is addressed, since I think that it impacts how it should be resolved.

alokedesai added 2 commits September 10, 2024 20:30

Make Surface#configure non-fatal

a0f430d

Make Surface#get_current_texture non-fatal

fb5c957

alokedesai requested a review from a team as a code owner September 11, 2024 01:02

Cargo clippy

c321520

Wumpf self-requested a review September 11, 2024 08:44

Wumpf linked an issue Sep 11, 2024 that may be closed by this pull request

Surface Configuration Failure is Fatal #3586

Open

Wumpf requested changes Sep 11, 2024

View reviewed changes

Update wgpu/src/api/surface_texture.rs

ea67d72

Co-authored-by: Andreas Reich <[email protected]>

Wumpf self-requested a review September 12, 2024 07:21

Wumpf mentioned this pull request Sep 15, 2024

Generic "parent device lost" error/crash after #6092 #6279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Surface#configure and Surface#get_current_texture non-fatal #6253

Make Surface#configure and Surface#get_current_texture non-fatal #6253

alokedesai commented Sep 11, 2024 •

edited

Loading

Wumpf left a comment

Wumpf Sep 11, 2024

alokedesai Sep 11, 2024

Wumpf Sep 13, 2024

Wumpf Sep 11, 2024

alokedesai Sep 11, 2024

Wumpf Sep 11, 2024

alokedesai commented Sep 11, 2024

Make Surface#configure and Surface#get_current_texture non-fatal #6253

Are you sure you want to change the base?

Make Surface#configure and Surface#get_current_texture non-fatal #6253

Conversation

alokedesai commented Sep 11, 2024 • edited Loading

Wumpf left a comment

Choose a reason for hiding this comment

Wumpf Sep 11, 2024

Choose a reason for hiding this comment

alokedesai Sep 11, 2024

Choose a reason for hiding this comment

Wumpf Sep 13, 2024

Choose a reason for hiding this comment

Wumpf Sep 11, 2024

Choose a reason for hiding this comment

alokedesai Sep 11, 2024

Choose a reason for hiding this comment

Wumpf Sep 11, 2024

Choose a reason for hiding this comment

alokedesai commented Sep 11, 2024

alokedesai commented Sep 11, 2024 •

edited

Loading