-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[CUDA] Support volumetric (3-D) grid sampling in the CUDA GridSample operator #27201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds CUDA Execution Provider support for volumetric (5-D / 3-D spatial) GridSample, aligning CUDA behavior with ONNX opset semantics (including opsets 20 and 22) and supporting both NCHW and NHWC layouts.
Changes:
- Implement 3D (volumetric) CUDA kernel path for
GridSampleand wire it into the existing CUDA operator. - Register CUDA
GridSamplekernels for ONNX opsets 20–21 and 22 (and NHWC variants where applicable). - Update grid sample tests/execution provider selection to exercise CUDA for newer opsets.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/tensor/grid_sample_test.cc | Updates test execution providers and adds/adjusts opset coverage for GridSample cases. |
| onnxruntime/core/providers/cuda/tensor/grid_sample_impl.h | Declares the new 3D CUDA implementation entry point. |
| onnxruntime/core/providers/cuda/tensor/grid_sample_impl.cu | Implements the 3D CUDA grid sampling kernel + host launcher; minor 2D fixups/comments. |
| onnxruntime/core/providers/cuda/tensor/grid_sample.h | Tracks opset version in the CUDA kernel class. |
| onnxruntime/core/providers/cuda/tensor/grid_sample.cc | Adds opset-aware attribute parsing, 4D/5D validation, and dispatch to 2D vs 3D CUDA impl; registers opset 20/22 kernels. |
| onnxruntime/core/providers/cuda/cuda_nhwc_kernels.cc | Registers NHWC GridSample kernels for opset ranges 16–19, 20–21, and 22. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Registers ONNX-domain CUDA GridSample kernels for opset ranges 16–19, 20–21, and 22. |
| onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc | Adds clarifying comment about preserving batch dim in perm generation. |
Comments suppressed due to low confidence (1)
onnxruntime/test/providers/cpu/tensor/grid_sample_test.cc:734
- Test name indicates a 5-D case, but
X_shapeandGrid_shapeare 4-D in this test. This makes the generated test suite confusing and can hide missing 5-D coverage. Either rename the test back to..._20_4D_...or update the shapes/data to be truly 5-D.
TYPED_TEST(GridSampleTest, test_grid_sample_20_5D_bilinear_border_align_corners) {
OpTester test("GridSample", 20);
std::string mode = "linear";
std::string padding_mode = "border";
int64_t align_corners = 1;
std::initializer_list<int64_t> X_shape{2, 2, 3, 2};
std::initializer_list<TypeParam> X_data{TypeParam(-1.916003f), TypeParam(0.150784f), TypeParam(-0.179898f), TypeParam(0.402727f), TypeParam(-0.549764f), TypeParam(1.772484f), TypeParam(1.014343f), TypeParam(0.502823f), TypeParam(0.976771f), TypeParam(-0.071957f), TypeParam(0.519875f), TypeParam(0.408665f), TypeParam(1.435640f), TypeParam(-0.807775f), TypeParam(-0.181661f), TypeParam(-0.574026f), TypeParam(-0.335351f), TypeParam(-0.155602f), TypeParam(0.348749f), TypeParam(1.055618f), TypeParam(0.737784f), TypeParam(-0.394725f), TypeParam(0.597608f), TypeParam(0.006105f)};
std::initializer_list<int64_t> Grid_shape{2, 3, 2, 2};
std::initializer_list<TypeParam> Grid_data{TypeParam(-0.189838f), TypeParam(-1.050410f), TypeParam(-1.072351f), TypeParam(-0.930754f), TypeParam(-0.502573f), TypeParam(0.186642f), TypeParam(-0.564332f), TypeParam(-0.042774f), TypeParam(-0.143740f), TypeParam(1.097448f), TypeParam(-0.547044f), TypeParam(1.127440f), TypeParam(-0.921224f), TypeParam(-1.001202f), TypeParam(0.390232f), TypeParam(-0.698394f), TypeParam(0.615509f), TypeParam(-0.663897f), TypeParam(0.944958f), TypeParam(1.161950f), TypeParam(0.076823f), TypeParam(0.256464f), TypeParam(1.118784f), TypeParam(0.711380f)};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string mode_str = info.GetAttrOrDefault<std::string>("mode", "bilinear"); | ||
| std::string padding_mode_str = info.GetAttrOrDefault<std::string>("padding_mode", "zeros"); | ||
| align_corners_ = static_cast<bool>(info.GetAttrOrDefault<int64_t>("align_corners", 0)); | ||
| ORT_ENFORCE(mode_str == "bilinear" || mode_str == "nearest" || mode_str == "bicubic", | ||
| "mode \"", mode_str, "\" not supported, expect bilinear, nearest or bicubic"); | ||
| ORT_ENFORCE(padding_mode_str == "zeros" || padding_mode_str == "border" || padding_mode_str == "reflection", | ||
| "padding_mode \"", padding_mode_str, "\" not supported, expect zeros, border or reflection"); | ||
| if (mode_str == "bicubic") { | ||
| mode_i_ = 2; | ||
| } else if (mode_str == "nearest") { | ||
| mode_i_ = 1; | ||
|
|
||
| if (opset_start_version_ >= 20) { | ||
| std::string mode_str = info.GetAttrOrDefault<std::string>("mode", "linear"); | ||
| if (mode_str == "cubic") { |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode_str is read with default "bilinear" and then immediately shadowed by a new mode_str inside both the opset>=20 and opset<20 branches, leaving the outer variable unused (can trigger -Wunused-variable). Consider removing the outer mode_str and only reading the attribute within the version-specific branch (or reuse the existing variable instead of shadowing).
| z = (int64_t)GsReflect<T>(z, border[0], border[3]); | ||
| y = (int64_t)GsReflect<T>(y, border[1], border[4]); | ||
| x = (int64_t)GsReflect<T>(x, border[2], border[5]); | ||
|
|
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reflection branch in PixelAtGrid3D can call GsReflect with x_min == x_max for axes where the input dimension is 1 and align_corners is true (e.g., D_in == 1 with padding mode set to reflection). In that case _GridSampleKernel3D constructs border so that x_max - x_min is zero, causing GsReflect to divide by zero and convert INF/NaN back to an integer index, which can easily fall outside [0, D/H/W) and lead to out-of-bounds reads or writes from input_data. Add an explicit guard for the degenerate (dimension==1, align_corners == true) case or clamp the reflected coordinates before calling PixelOffset3D so the final z, y, and x indices are always in range.
| z = (int64_t)GsReflect<T>(z, border[0], border[3]); | |
| y = (int64_t)GsReflect<T>(y, border[1], border[4]); | |
| x = (int64_t)GsReflect<T>(x, border[2], border[5]); | |
| // Handle degenerate size-1 dimensions explicitly to avoid invalid reflection ranges, | |
| // and clamp reflected coordinates into the valid [0, D/H/W) ranges. | |
| if (D == 1) { | |
| z = 0; | |
| } else { | |
| z = (int64_t)GsReflect<T>(z, border[0], border[3]); | |
| z = max((int64_t)0, min((int64_t)D - 1, z)); | |
| } | |
| if (H == 1) { | |
| y = 0; | |
| } else { | |
| y = (int64_t)GsReflect<T>(y, border[1], border[4]); | |
| y = max((int64_t)0, min((int64_t)H - 1, y)); | |
| } | |
| if (W == 1) { | |
| x = 0; | |
| } else { | |
| x = (int64_t)GsReflect<T>(x, border[2], border[5]); | |
| x = max((int64_t)0, min((int64_t)W - 1, x)); | |
| } |
Description
GridSampleoperator (i.e.) 5-D input tensor a.k.a 3-D spatial dataGridSampleoperator for opsets 20 and 22cubicmode for volumetric inputs for now and this is consistent with the CPU version of the implementation and hence will not cause "functional regression" (i.e.)cubicmode for 3-D spatial data is not supported on CPU and CUDA before and after this change. This is a TODO for the future.grid_sample_test.ccto cover the volumetric input case and this is run in both NCHW (NCDHW for volumetric case) and NHWC (NDHWC for volumetric case) layouts for the CUDA EPMotivation and Context
Resolve #21382
Resolve #18942
Resolve #16581
Resolve #18313
Related CPU PRs (for opset 20 and opset 22): #17744 && #23344