Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Dec 10, 2025

Summary

This PR contains a design document proposing changes to Buffer.fill() to address feedback from PR #1318 that was prematurely merged.

Please review the design document and provide feedback before implementation proceeds.

Related

cc @leofang @kkraus14 @rparolin

@Andy-Jost Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Dec 10, 2025
@Andy-Jost Andy-Jost self-assigned this Dec 10, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost added this to the cuda.core beta 10 milestone Dec 10, 2025
@Andy-Jost
Copy link
Contributor Author

@kkraus14 @leofang @rparolin

I don't plan to commit the proposal document buffer-fill-redesign.md. Just posting this to close on the redesign before reimplementing the feature.

You can toggle between "source diff" and "rich diff" for commenting and rendering, respectively.

@Andy-Jost Andy-Jost force-pushed the buffer-fill-redesign branch from 9afc9d2 to 39351ec Compare December 10, 2025 20:40
Comment on lines 140 to 185
### Option C: Hybrid Approach

**`Buffer.fill()`** - Flexible:
```python
def fill(self, value, *, stream: Stream | GraphBuilder):
"""Fill buffer with value pattern.

Parameters
----------
value : int, numpy scalar, or bytes
- int: 1-byte fill (0-255)
- numpy.{u}int{8,16,32}: corresponding width fill
- bytes (1-4 bytes): raw byte pattern fill
"""
```

**`StridedMemoryView.fill()`** - Uses dtype:
```python
def fill(self, value, *, stream: Stream | GraphBuilder):
"""Fill view elements with value (uses view's dtype)."""
```

**Pros**:
- Buffer API is flexible and discoverable
- StridedMemoryView uses its natural dtype context
- Consistent with launch() scalar handling

**Cons**:
- More complex implementation in Buffer

## Recommendation

**Option C (Hybrid Approach)** provides the best balance:

1. **Buffer.fill()** accepts:
- `int` → 1-byte fill (value 0-255)
- `numpy.int8/uint8` → 1-byte fill
- `numpy.int16/uint16` → 2-byte fill
- `numpy.int32/uint32` → 4-byte fill
- `bytes` (length 1, 2, or 4) → raw pattern fill

2. **StridedMemoryView.fill()** (new):
- Uses the view's dtype to determine width and signedness
- Value is validated against the dtype's range

3. **Remove `width` parameter** from current API (breaking change, but before release)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the main part of the proposal.

Comment on lines 330 to 332
1. **Should plain Python `int` default to 1-byte or 4-byte?**
- Proposal: 1-byte (consistent with "Buffer is untyped bytes")
- Alternative: Error if int > 255, requiring explicit dtype for larger values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python integers have a built-in to_bytes function that errors if you try to call it with something larger than 1-byte where it makes sense to limit it to 1-byte in my opinion.

@kkraus14
Copy link
Collaborator

@Andy-Jost I'm aligned to the approach you've detailed out here. My only suggested change is related to the use of Python buffer protocol instead of specifically handling bytes and numpy scalar objects. The rest looks good to me!

@Andy-Jost
Copy link
Contributor Author

@Andy-Jost I'm aligned to the approach you've detailed out here. My only suggested change is related to the use of Python buffer protocol instead of specifically handling bytes and numpy scalar objects. The rest looks good to me!

Thanks for the quick feedback, Keith. I'll follow up with a code change based on these suggestions.

@leofang
Copy link
Member

leofang commented Dec 10, 2025

@kkraus14 @Andy-Jost

Question:

  1. Wearing my array API hat, wouldn't Option C trap us in the unfortunate business of implementing type promotion rules? What if SMV.dtype is not compatible with the value type? Is it something that we want to handle at the cuda.core level?
  2. Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

@Andy-Jost
Copy link
Contributor Author

@kkraus14 @Andy-Jost

Question:

  1. Wearing my array API hat, wouldn't Option C trap us in the unfortunate business of implementing type promotion rules? What if SMV.dtype is not compatible with the value type? Is it something that we want to handle at the cuda.core level?

Let me update the proposal and to strip it down (e.g., Option C is off the table now).

If we go with Keith's suggestion to rely on the buffer protocol, we can sidestep type promotion for the basic Buffer.fill. However, there are gotchas with StridedMemoryView.fill and maybe this is what you have in mind. For instance:

view = StridedMemoryView.from_buffer(buffer, layout, dtype=np.float32)
view.fill(1, stream=stream)

Is the fill value np.float32(1) or the byte 0x01?

Given the added complexity, I'm inclined to:

  1. Implement Buffer.fill with the buffer protocol approach.
  2. Defer StridedMemoryView.fill for now, as it needs more design
  1. Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

I don't have any sense of how the buffer protocol will compare with the ParamHolder dispatch. Buffer.fill is potentially simpler because it only supports 1/2/4 bytes. How about this: I can reimplement Buffer.fill based on the buffer protocol and benchmark the performance against ParamHolder. That at least allows us to fix the API right away. Then, based on the benchmark results we can decide what to do next with ParamHolder.

Design document addressing feedback from PR NVIDIA#1318 that was
prematurely merged. Proposes Option C (Hybrid Approach):
- Buffer.fill() infers width from value type (int, numpy scalar, bytes)
- StridedMemoryView.fill() uses view's dtype for width/signedness
- Removes explicit width parameter

Related to NVIDIA#1345
@Andy-Jost Andy-Jost force-pushed the buffer-fill-redesign branch from 39351ec to 279f90c Compare December 11, 2025 01:13
@Andy-Jost
Copy link
Contributor Author

New (much shorter) proposal doc is live.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me!

----------
value : int or buffer-protocol object
- int: Must be in range [0, 256). Converted to 1 byte.
- buffer-protocol object: Must be 1, 2, or 4 bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpick, but there's the collections.abc.Buffer type that exists specifically to indicate objects that support buffer protocol

@kkraus14
Copy link
Collaborator

2. Can we verify using Python buffer protocol is faster than the ParamHolder dispatcher? My impression is that the buffer protocol overhead is nontrivial. Conversely, if it is faster we should create a task to track rewriting the dispatcher to use buffer protocol.

I think we could special case and fast path certain object types and then fall back to buffer protocol if desired? I.E. support bytes objects zero copy directly via the CPython API first, then support int objects via a fastpath using the CPython API, and then fall back to buffer protocol?

@leofang
Copy link
Member

leofang commented Dec 11, 2025

There is one thing that came to my mind last night. @Andy-Jost you saw @chloechia4's design doc on pythonic cuFILE support. There is one bit of cuFILE that I like very much (at least the concept of it), which is they support deferred C scalar values by accepting pointers in their read/write APIs.

Our current Buffer.fill design cannot be deferred. The value must be to be materialized at the invocation time, due to the underlying memset C API constraint. We could work around this by conditionally dispatching to memcpy, but then we need a way to express this special case and I suspect we'd have to use Python int to serve as the pointer address (like in launch). WDYT?

@Andy-Jost
Copy link
Contributor Author

There is one thing that came to my mind last night. @Andy-Jost you saw @chloechia4's design doc on pythonic cuFILE support. There is one bit of cuFILE that I like very much (at least the concept of it), which is they support deferred C scalar values by accepting pointers in their read/write APIs.

Our current Buffer.fill design cannot be deferred. The value must be to be materialized at the invocation time, due to the underlying memset C API constraint. We could work around this by conditionally dispatching to memcpy, but then we need a way to express this special case and I suspect we'd have to use Python int to serve as the pointer address (like in launch). WDYT?

Discussed: not something we can support due to driver API limitations

@Andy-Jost Andy-Jost closed this Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit Buffer.fill()

3 participants