Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Surface#configure and Surface#get_current_texture non-fatal #6253

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

alokedesai
Copy link

@alokedesai alokedesai commented Sep 11, 2024

Connections
Link to the issues addressed by this PR, or dependent PRs in other repositories
#3586

Description
This PR attempts to fix API shortcomings in wgpu where Surface#configure and Surface#get_current_texture return fatal errors.

There are two primary changes:

  1. For Surface#configure, use handle_error_nolabel instead of handle_error_fatal. Warp (tens of thousands of daily users) has already made this change in a fork and has not had any issues.

  2. For Surface#get_current_texture change the call to handle_error_fatal to also use handle_error_nolabel. This function is a little trickier to support because we don't easily have access to an ErrorSink and we need to figure out what to return in the case of an error. For the former, I decided to store an optional ErrorSink on the Surface that is only set upon device configuration. I chose to do this so that wgpu consumers can actually use the device push/pop error scope functions to handle the error. I also decided to add new SurfaceStatus::Unknown and SurfaceError::Other variants to denote cases where we weren't able to derive the state of the surface due to an unknown.

Assuming this approach generally looks good, I can address comments and add a changelog entry. Thank you for all of your hard work maintaining this repo!

Testing
I forced both of these functions to always return an error and confirmed that I'm now able to properly handle them as a consumer of this library.

Checklist

  • Run cargo fmt.
  • Run cargo clippy. If applicable, add:
    • --target wasm32-unknown-unknown
    • --target wasm32-unknown-emscripten
  • Run cargo xtask test to run tests.
  • Add change to CHANGELOG.md. See simple instructions inside file.

@alokedesai alokedesai requested a review from a team as a code owner September 11, 2024 01:02
@Wumpf Wumpf self-requested a review September 11, 2024 08:44
@Wumpf Wumpf linked an issue Sep 11, 2024 that may be closed by this pull request
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great overall, definitely an improvement over status quo!
Comments are mostly cosmetic and discussion of future improvements.

wgpu/src/api/surface_texture.rs Outdated Show resolved Hide resolved
Err(err) => self.handle_error_fatal(err, "Surface::get_current_texture_view"),
Err(err) => match surface_data.error_sink.lock().as_ref() {
Some(error_sink) => {
self.handle_error_nolabel(error_sink, err, "Surface::get_current_texture_view");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could forward this error as-is. We might add more status codes for this in the future or enhance the state that is Unknown right now to carry more information 🤔. E.g. "not configured" is an error but not a status right now which seems strange to me.

What irks me in particular about this is that according to your reports you are seeing spurious errors here with many of your users which implies that going to the error sink is the wrong thing to do: a reported error should imply application-implementor error rather than driver/end-user error. Meaning it's something that one would be expected to dynamically handle as an application-implementor. Note also that handle_error_* will cause a panic unless a custom error handler is installed.

... or are the errors you're seeing with your users exclusively about configure?

To clarify, we don't have to solve this here all the way. This is more of a sentiment / open question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash we're seeing right now is that we occasionally seem to get "Parent Device is Lost" validation errors when calling Surface#configure and Surface#get_current_texture. This only seems to be an issue with NVIDIA devices.

I've attached an example stacktrace of that below:
image

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result. Which means in the case we return SurfaceError::Unknown the function returns an error but you really need to go use the error callbacks to get the full details on the error to determine how to handle it.

What do you think is the next step here? I think what is in this PR is ok (and acceptable) but would love to brainstorm how to make it more ergonomic. We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for sharing those callstacks. This kind of tracking is incredibly valuable. We do some of that at Rerun as well but our userbase is a lot smaller at this point and a lot of the traffic is blocked (and it's easy to opt out), we don't have that good of an insight.

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

100% agree the more we can forward directly and still be WebGPU compatible the better!
Seeing the picture a bit better by now, I think I actually have to row back a bit on my statements about that we can't return Results at all:

  • whenever webgpu spec says that something throws on the content timeline (the green marked stuff), we can catch those failures and convert them to a Result
    • the "queue up the error" case is for every failure on the device timeline and that is indeed most of the validation and why unfortunately a ton of things can't be Results and instead are asynchronously reported errors
  • whenever a failure (or better, a family of errors) occurs that can only happen with wgpu-core and never with the js webgpu api, then we can return a result

get_current_texture in particular has practically no failure case other than some resources being null which afaik isn't possible in our Rust interface.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result

Yes agree again. With the above in mind about what failures are possible in wgpu-core vs webgpu this makes a bit more sense.

I think what is in this PR is ok (and acceptable)

Yep, definitely an improvement. Let's not get perfect in the way of good 😄

We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

I think we should go over all surface relateed results & error codes again across the crate and unify them as much as possible. I haven't thought it through yet, but I think we could have status codes only show up as part of usable textures and everything else be an error with no status code at all.
Right now there's a lot of weird transformation going on between the wgpu interface, the wgpu-core context (still inside the wgpu crate) and the wgpu-core crate which has the actual implementation of the surfaces. Ideally we just pass things through as-is.

Not a lot of prior art to this unfortunately. There should probably more places in wgpu where we give back some limited Result - grepping for throw in the webgpu spec to see where the content timeline can fail yields a handful of results and they should probably all translate into Result. Very few do today, buffer mapping has some iirc.

Comment on lines +5629 to +5630
/// The surface status is not known.
Unknown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if Unknown is really the right term here. We know that get_current_texture failed for a known reason when we get this. But then again we don't know what state our surface is in at this point (the error may have caused mayhem such that reconfigure is in order or the next get_current_texture call might just succeed).

So maybe it's enough to just clarify this here a little bit:

Suggested change
/// The surface status is not known.
Unknown,
/// The surface status is not known since `get_current_texture` previously failed.
Unknown,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of clarifying via the comment. The reason I chose Unknown as the name here is exactly what you mentioned: we don't really know the status of the surface when we get this error.

For example, if we get a DeviceLost error we can't reliably report anything about the status of the Surface (since we got an error higher in the stack)

Comment on lines +70 to +71
/// Acquiring a texture failed for an unknown reason
Other,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar'ish suggestion SurfaceStatus::Unknown: It's not really an unknown reason, we just haven't passed it through (yet?) 🤔

Suggested change
/// Acquiring a texture failed for an unknown reason
Other,
/// Acquiring a texture failed with a generic error. Check error callbacks for more information.
Other,

string below needs adjusting as well.

@alokedesai
Copy link
Author

Thanks for the fast review! Will defer the remaining comment until #6253 (comment) is addressed, since I think that it impacts how it should be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surface Configuration Failure is Fatal
2 participants