Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Surface#configure and Surface#get_current_texture non-fatal #6253

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/src/framework.rs
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@ impl SurfaceWrapper {
// If the surface is outdated, or was lost, reconfigure it.
wgpu::SurfaceError::Outdated
| wgpu::SurfaceError::Lost
| wgpu::SurfaceError::Other
// If OutOfMemory happens, reconfiguring may not help, but we might as well try
| wgpu::SurfaceError::OutOfMemory,
) => {
Expand Down
2 changes: 2 additions & 0 deletions wgpu-types/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5626,6 +5626,8 @@ pub enum SurfaceStatus {
Outdated,
/// The surface under the swap chain is lost.
Lost,
/// The surface status is not known.
Unknown,
Comment on lines +5629 to +5630
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if Unknown is really the right term here. We know that get_current_texture failed for a known reason when we get this. But then again we don't know what state our surface is in at this point (the error may have caused mayhem such that reconfigure is in order or the next get_current_texture call might just succeed).

So maybe it's enough to just clarify this here a little bit:

Suggested change
/// The surface status is not known.
Unknown,
/// The surface status is not known since `get_current_texture` previously failed.
Unknown,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of clarifying via the comment. The reason I chose Unknown as the name here is exactly what you mentioned: we don't really know the status of the surface when we get this error.

For example, if we get a DeviceLost error we can't reliably report anything about the status of the Surface (since we got an error higher in the stack)

}

/// Nanosecond timestamp used by the presentation engine.
Expand Down
1 change: 1 addition & 0 deletions wgpu/src/api/surface.rs
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ impl Surface<'_> {
SurfaceStatus::Timeout => return Err(SurfaceError::Timeout),
SurfaceStatus::Outdated => return Err(SurfaceError::Outdated),
SurfaceStatus::Lost => return Err(SurfaceError::Lost),
SurfaceStatus::Unknown => return Err(SurfaceError::Other),
};

let guard = self.config.lock();
Expand Down
3 changes: 3 additions & 0 deletions wgpu/src/api/surface_texture.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ pub enum SurfaceError {
Lost,
/// There is no more memory left to allocate a new frame.
OutOfMemory,
/// Acquiring a texture failed for an unknown reason
Other,
Comment on lines +70 to +71
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar'ish suggestion SurfaceStatus::Unknown: It's not really an unknown reason, we just haven't passed it through (yet?) 🤔

Suggested change
/// Acquiring a texture failed for an unknown reason
Other,
/// Acquiring a texture failed with a generic error. Check error callbacks for more information.
Other,

string below needs adjusting as well.

}
static_assertions::assert_impl_all!(SurfaceError: Send, Sync);

Expand All @@ -77,6 +79,7 @@ impl fmt::Display for SurfaceError {
Self::Outdated => "The underlying surface has changed, and therefore the swap chain must be updated",
Self::Lost => "The swap chain has been lost and needs to be recreated",
Self::OutOfMemory => "There is no more memory left to allocate a new frame",
Self::Other => "Acquiring a texture failed for an unknown reason"
})
}
}
Expand Down
21 changes: 19 additions & 2 deletions wgpu/src/backend/wgpu_core.rs
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,9 @@ pub struct Surface {
/// Configured device is needed to know which backend
/// code to execute when acquiring a new frame.
configured_device: Mutex<Option<wgc::id::DeviceId>>,
/// The error sink with which to report errors.
/// `None` if the surface has not been configured.
error_sink: Mutex<Option<ErrorSink>>,
}

#[derive(Debug)]
Expand Down Expand Up @@ -572,6 +575,7 @@ impl crate::Context for ContextWgpuCore {
Ok(Surface {
id,
configured_device: Mutex::default(),
error_sink: Mutex::default(),
})
}

Expand Down Expand Up @@ -707,9 +711,10 @@ impl crate::Context for ContextWgpuCore {
.0
.surface_configure(surface_data.id, device_data.id, config);
if let Some(e) = error {
self.handle_error_fatal(e, "Surface::configure");
self.handle_error_nolabel(&device_data.error_sink, e, "Surface::configure");
} else {
*surface_data.configured_device.lock() = Some(device_data.id);
*surface_data.error_sink.lock() = Some(device_data.error_sink.clone());
}
}

Expand All @@ -736,7 +741,19 @@ impl crate::Context for ContextWgpuCore {
},
)
}
Err(err) => self.handle_error_fatal(err, "Surface::get_current_texture_view"),
Err(err) => match surface_data.error_sink.lock().as_ref() {
Some(error_sink) => {
self.handle_error_nolabel(error_sink, err, "Surface::get_current_texture_view");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could forward this error as-is. We might add more status codes for this in the future or enhance the state that is Unknown right now to carry more information 🤔. E.g. "not configured" is an error but not a status right now which seems strange to me.

What irks me in particular about this is that according to your reports you are seeing spurious errors here with many of your users which implies that going to the error sink is the wrong thing to do: a reported error should imply application-implementor error rather than driver/end-user error. Meaning it's something that one would be expected to dynamically handle as an application-implementor. Note also that handle_error_* will cause a panic unless a custom error handler is installed.

... or are the errors you're seeing with your users exclusively about configure?

To clarify, we don't have to solve this here all the way. This is more of a sentiment / open question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash we're seeing right now is that we occasionally seem to get "Parent Device is Lost" validation errors when calling Surface#configure and Surface#get_current_texture. This only seems to be an issue with NVIDIA devices.

I've attached an example stacktrace of that below:
image

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result. Which means in the case we return SurfaceError::Unknown the function returns an error but you really need to go use the error callbacks to get the full details on the error to determine how to handle it.

What do you think is the next step here? I think what is in this PR is ok (and acceptable) but would love to brainstorm how to make it more ergonomic. We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for sharing those callstacks. This kind of tracking is incredibly valuable. We do some of that at Rerun as well but our userbase is a lot smaller at this point and a lot of the traffic is blocked (and it's easy to opt out), we don't have that good of an insight.

The most ergonomic solution to this (IMO, obviously you know this code better than I do) would be to forward the error and return a result in both functions so the caller can handle them. As you mentioned in the issue when we were discussing earlier, we need to conform with the webgpu spec around handling errors asynchronously so I'm not sure that's totally feasible.

100% agree the more we can forward directly and still be WebGPU compatible the better!
Seeing the picture a bit better by now, I think I actually have to row back a bit on my statements about that we can't return Results at all:

  • whenever webgpu spec says that something throws on the content timeline (the green marked stuff), we can catch those failures and convert them to a Result
    • the "queue up the error" case is for every failure on the device timeline and that is indeed most of the validation and why unfortunately a ton of things can't be Results and instead are asynchronously reported errors
  • whenever a failure (or better, a family of errors) occurs that can only happen with wgpu-core and never with the js webgpu api, then we can return a result

get_current_texture in particular has practically no failure case other than some resources being null which afaik isn't possible in our Rust interface.

What I'm find awkward about the current approach (and I think this is partly to your point) is that the get_current_texture already returns a Result

Yes agree again. With the above in mind about what failures are possible in wgpu-core vs webgpu this makes a bit more sense.

I think what is in this PR is ok (and acceptable)

Yep, definitely an improvement. Let's not get perfect in the way of good 😄

We can't return Ok (because we don't have a SurfaceTexture) but we could return an actual Err with the details of what happened (instead of just SurfaceError::Unknown). I'm not sure that's that much better though since we'd duplicate the error across the return type and the error callbacks...Is there prior art for error handling like this in wgpu that you know of? Would love suggestions on this :D

I think we should go over all surface relateed results & error codes again across the crate and unify them as much as possible. I haven't thought it through yet, but I think we could have status codes only show up as part of usable textures and everything else be an error with no status code at all.
Right now there's a lot of weird transformation going on between the wgpu interface, the wgpu-core context (still inside the wgpu crate) and the wgpu-core crate which has the actual implementation of the surfaces. Ideally we just pass things through as-is.

Not a lot of prior art to this unfortunately. There should probably more places in wgpu where we give back some limited Result - grepping for throw in the webgpu spec to see where the content timeline can fail yields a handful of results and they should probably all translate into Result. Very few do today, buffer mapping has some iirc.

(
None,
SurfaceStatus::Unknown,
SurfaceOutputDetail {
surface_id: surface_data.id,
},
)
}
None => self.handle_error_fatal(err, "Surface::get_current_texture_view"),
},
}
}

Expand Down