Design proposal: Buffer.fill() redesign #1357

Andy-Jost · 2025-12-10T20:30:28Z

Summary

This PR contains a design document proposing changes to Buffer.fill() to address feedback from PR #1318 that was prematurely merged.

Please review the design document and provide feedback before implementation proceeds.

+### Option C: Hybrid Approach
+
+**`Buffer.fill()`** - Flexible:
+```python
+def fill(self, value, *, stream: Stream | GraphBuilder):
+    """Fill buffer with value pattern.
+
+    Parameters
+    ----------
+    value : int, numpy scalar, or bytes
+        - int: 1-byte fill (0-255)
+        - numpy.{u}int{8,16,32}: corresponding width fill
+        - bytes (1-4 bytes): raw byte pattern fill
+    """
+```
+
+**`StridedMemoryView.fill()`** - Uses dtype:
+```python
+def fill(self, value, *, stream: Stream | GraphBuilder):
+    """Fill view elements with value (uses view's dtype)."""
+```
+
+**Pros**:
+- Buffer API is flexible and discoverable
+- StridedMemoryView uses its natural dtype context
+- Consistent with launch() scalar handling
+
+**Cons**:
+- More complex implementation in Buffer
+
+## Recommendation
+
+**Option C (Hybrid Approach)** provides the best balance:
+
+1. **Buffer.fill()** accepts:
+   - `int` → 1-byte fill (value 0-255)
+   - `numpy.int8/uint8` → 1-byte fill
+   - `numpy.int16/uint16` → 2-byte fill
+   - `numpy.int32/uint32` → 4-byte fill
+   - `bytes` (length 1, 2, or 4) → raw pattern fill
+
+2. **StridedMemoryView.fill()** (new):
+   - Uses the view's dtype to determine width and signedness
+   - Value is validated against the dtype's range
+
+3. **Remove `width` parameter** from current API (breaking change, but before release)


Here's the main part of the proposal.

cuda_core/docs/design/buffer-fill-redesign.md

kkraus14 · 2025-12-10T20:56:13Z

cuda_core/docs/design/buffer-fill-redesign.md

+1. **Should plain Python `int` default to 1-byte or 4-byte?**
+   - Proposal: 1-byte (consistent with "Buffer is untyped bytes")
+   - Alternative: Error if int > 255, requiring explicit dtype for larger values


Python integers have a built-in to_bytes function that errors if you try to call it with something larger than 1-byte where it makes sense to limit it to 1-byte in my opinion.

cuda_core/docs/design/buffer-fill-redesign.md

kkraus14 · 2025-12-10T21:11:09Z

@Andy-Jost I'm aligned to the approach you've detailed out here. My only suggested change is related to the use of Python buffer protocol instead of specifically handling bytes and numpy scalar objects. The rest looks good to me!

Andy-Jost · 2025-12-10T21:34:08Z

@Andy-Jost I'm aligned to the approach you've detailed out here. My only suggested change is related to the use of Python buffer protocol instead of specifically handling bytes and numpy scalar objects. The rest looks good to me!

Thanks for the quick feedback, Keith. I'll follow up with a code change based on these suggestions.

leofang · 2025-12-10T22:30:34Z

@kkraus14 @Andy-Jost

Question:

Wearing my array API hat, wouldn't Option C trap us in the unfortunate business of implementing type promotion rules? What if SMV.dtype is not compatible with the value type? Is it something that we want to handle at the cuda.core level?
Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

Andy-Jost · 2025-12-11T00:54:03Z

@kkraus14 @Andy-Jost

Question:

Wearing my array API hat, wouldn't Option C trap us in the unfortunate business of implementing type promotion rules? What if SMV.dtype is not compatible with the value type? Is it something that we want to handle at the cuda.core level?

Let me update the proposal and to strip it down (e.g., Option C is off the table now).

If we go with Keith's suggestion to rely on the buffer protocol, we can sidestep type promotion for the basic Buffer.fill. However, there are gotchas with StridedMemoryView.fill and maybe this is what you have in mind. For instance:

view = StridedMemoryView.from_buffer(buffer, layout, dtype=np.float32)
view.fill(1, stream=stream)

Is the fill value np.float32(1) or the byte 0x01?

Given the added complexity, I'm inclined to:

Implement Buffer.fill with the buffer protocol approach.
Defer StridedMemoryView.fill for now, as it needs more design

Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

I don't have any sense of how the buffer protocol will compare with the ParamHolder dispatch. Buffer.fill is potentially simpler because it only supports 1/2/4 bytes. How about this: I can reimplement Buffer.fill based on the buffer protocol and benchmark the performance against ParamHolder. That at least allows us to fix the API right away. Then, based on the benchmark results we can decide what to do next with ParamHolder.

Design document addressing feedback from PR NVIDIA#1318 that was prematurely merged. Proposes Option C (Hybrid Approach): - Buffer.fill() infers width from value type (int, numpy scalar, bytes) - StridedMemoryView.fill() uses view's dtype for width/signedness - Removes explicit width parameter Related to NVIDIA#1345

Andy-Jost · 2025-12-11T01:19:16Z

New (much shorter) proposal doc is live.

kkraus14

This approach looks good to me!

kkraus14 · 2025-12-11T02:10:52Z

cuda_core/docs/design/buffer-fill-redesign.md

+    ----------
+    value : int or buffer-protocol object
+        - int: Must be in range [0, 256). Converted to 1 byte.
+        - buffer-protocol object: Must be 1, 2, or 4 bytes.


super nitpick, but there's the collections.abc.Buffer type that exists specifically to indicate objects that support buffer protocol

kkraus14 · 2025-12-11T02:23:01Z

2. Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

I think we could special case and fast path certain object types and then fall back to buffer protocol if desired? I.E. support bytes objects zero copy directly via the CPython API first, then support int objects via a fastpath using the CPython API, and then fall back to buffer protocol?

leofang · 2025-12-11T20:39:30Z

There is one thing that came to my mind last night. @Andy-Jost you saw @chloechia4's design doc on pythonic cuFILE support. There is one bit of cuFILE that I like very much (at least the concept of it), which is they support deferred C scalar values by accepting pointers in their read/write APIs.

Our current Buffer.fill design cannot be deferred. The value must be to be materialized at the invocation time, due to the underlying memset C API constraint. We could work around this by conditionally dispatching to memcpy, but then we need a way to express this special case and I suspect we'd have to use Python int to serve as the pointer address (like in launch). WDYT?

Andy-Jost · 2025-12-11T21:17:37Z

There is one thing that came to my mind last night. @Andy-Jost you saw @chloechia4's design doc on pythonic cuFILE support. There is one bit of cuFILE that I like very much (at least the concept of it), which is they support deferred C scalar values by accepting pointers in their read/write APIs.

Our current Buffer.fill design cannot be deferred. The value must be to be materialized at the invocation time, due to the underlying memset C API constraint. We could work around this by conditionally dispatching to memcpy, but then we need a way to express this special case and I suspect we'd have to use Python int to serve as the pointer address (like in launch). WDYT?

Discussed: not something we can support due to driver API limitations

Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Dec 10, 2025

Andy-Jost self-assigned this Dec 10, 2025

Andy-Jost added this to the cuda.core beta 10 milestone Dec 10, 2025

Andy-Jost force-pushed the buffer-fill-redesign branch from 9afc9d2 to 39351ec Compare December 10, 2025 20:40

Andy-Jost commented Dec 10, 2025

View reviewed changes

Andy-Jost requested review from kkraus14, leofang and rparolin December 10, 2025 20:44

kkraus14 reviewed Dec 10, 2025

View reviewed changes

Andy-Jost force-pushed the buffer-fill-redesign branch from 39351ec to 279f90c Compare December 11, 2025 01:13

kkraus14 reviewed Dec 11, 2025

View reviewed changes

Andy-Jost closed this Dec 11, 2025

Design proposal: Buffer.fill() redesign #1357

Design proposal: Buffer.fill() redesign #1357

Uh oh!

Conversation

Andy-Jost commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related

Uh oh!

copy-pr-bot bot commented Dec 10, 2025

Uh oh!

Andy-Jost commented Dec 10, 2025

Uh oh!

Andy-Jost Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkraus14 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkraus14 commented Dec 10, 2025

Uh oh!

Andy-Jost commented Dec 10, 2025

Uh oh!

leofang commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andy-Jost commented Dec 11, 2025

Uh oh!

Andy-Jost commented Dec 11, 2025

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

kkraus14 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

kkraus14 commented Dec 11, 2025

Uh oh!

leofang commented Dec 11, 2025

Uh oh!

Andy-Jost commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Andy-Jost commented Dec 10, 2025 •

edited

Loading

leofang commented Dec 10, 2025 •

edited

Loading