Skip to content

Conversation

@Kathryn-cat
Copy link
Contributor

@Kathryn-cat Kathryn-cat commented Oct 8, 2025

Summary of Changes

This PR introduces a unified DLPackExchangeAPI struct as described in proposal 175. This new convention replaces the previous mechanism of separate function pointers, and aligns with the latest DLPack standard as shown in PR 174.

Within the new DLPackExchangeAPI struct, it also includes a current_work_stream function pointer that allows more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges. All the conversion from/to DLPack has been updated to _no_sync, meaning you should use current_work_stream to explicitly handle stream synchronization. It also includes a non-owning DLTensor conversion to avoid unnecessary reference counting.

Following this change, the Python FFI for PyTorch has been updated to expose the new DLPackExchangeAPI struct via __c_dlpack_exchange_api__ on torch.Tensor.

The 3rdparty/dlpack has been updated to incorporate the latest commit.

Benchmark Results

The benchmark results show that the overhead of API by passing in arguments, doing conversion at cpp extension level, and return matches the expected speedup as described in the proposal 175.

Screenshot 2025-10-11 at 1 33 20 PM

Followup in later PR

This PR does not contain the unit tests for the cpp conversion. There will be a followup PR to add the tests.

@Kathryn-cat Kathryn-cat force-pushed the kathy/dlpack-c branch 2 times, most recently from f6895ff to 4060e27 Compare October 10, 2025 04:26
@Kathryn-cat Kathryn-cat marked this pull request as ready for review October 11, 2025 00:31
@Kathryn-cat Kathryn-cat changed the title wip: C Functions for DLPack Speed Exchange and Stream Handling C Functions for DLPack Speed Exchange and Stream Handling Oct 11, 2025
@Kathryn-cat Kathryn-cat changed the title C Functions for DLPack Speed Exchange and Stream Handling [DLPack] C Functions for DLPack Speed Exchange and Stream Handling Oct 11, 2025
@Kathryn-cat Kathryn-cat requested a review from tqchen October 11, 2025 02:47
@Kathryn-cat
Copy link
Contributor Author

@tqchen comments are addressed. I think this PR is in good shape now, let's separate the unit tests into another PR

@Kathryn-cat
Copy link
Contributor Author

/gemini Review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DLPack integration to use the new DLPackExchangeAPI struct, replacing the previous mechanism of separate function pointers. This is a significant improvement that aligns with the latest DLPack standard, enhances maintainability, and centralizes the DLPack C-API functions. The changes are extensive, touching C++ JIT code, Cython FFI bindings, and tests. The implementation is well-executed and consistent across the codebase. I've found one minor issue regarding device type handling that could affect non-CUDA GPU backends.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final comment, we are close, please fix the lint

@Kathryn-cat
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly refactors the DLPack integration by introducing a unified DLPackExchangeAPI struct. This change streamlines the process of exchanging tensors and handling device streams between different frameworks, particularly PyTorch, by consolidating related C functions into a single, more manageable interface. It also adds support for non-owning DLTensor conversions to improve efficiency.

Highlights

  • Unified DLPack Exchange API: A new DLPackExchangeAPI C struct has been introduced to consolidate all DLPack-related function pointers (for converting to/from PyObject, allocating tensors, and querying current work streams) into a single, streamlined interface. This replaces the previous method of passing individual function pointers.
  • Enhanced Stream Handling: The new DLPackExchangeAPI now includes a current_work_stream function pointer, which allows for more robust and integrated querying of the current device stream (e.g., CUDA stream) during DLPack tensor exchanges, particularly for non-CPU devices.
  • Non-Owning DLTensor Conversion: A new toDLPackNonOwningImpl function has been added to facilitate the creation of non-owning DLTensor instances directly from PyTorch tensors. This can lead to performance improvements by avoiding unnecessary reference counting overhead.
  • Refactored PyTorch Integration: The Python FFI for PyTorch has been updated to expose the new DLPackExchangeAPI struct via __c_dlpack_exchange_api__ on torch.Tensor, simplifying the integration and interaction with DLPack functionalities.
  • DLPack Submodule Update: The 3rdparty/dlpack submodule has been updated to a newer commit, which likely incorporates the necessary definitions and changes for the new DLPackExchangeAPI structure.
Changelog
  • 3rdparty/dlpack
    • Updated the DLPack submodule to a newer commit.
  • python/tvm_ffi/_optional_torch_c_dlpack.py
    • Introduced toDLPackNonOwningImpl for non-owning DLTensor conversion.
    • Refactored individual DLPack C functions into a unified TorchDLPackExchangeAPI struct.
    • Added TorchDLPackCurrentWorkStream for querying device streams.
    • Updated torch.Tensor to expose the new __c_dlpack_exchange_api__ instead of individual function pointers.
  • python/tvm_ffi/core.pyi
    • Updated type hints to reflect the change to __c_dlpack_exchange_api__.
  • python/tvm_ffi/cython/base.pxi
    • Added DLPACK_MAJOR_VERSION and DLPACK_MINOR_VERSION macros.
    • Defined the DLPackExchangeAPI struct and its associated function pointer types.
    • Updated TVMFFIPyCallContext and TVMFFIPyArgSetter to use the new const DLPackExchangeAPI* c_dlpack_exchange_api member.
  • python/tvm_ffi/cython/function.pxi
    • Modified make_ret to accept the DLPackExchangeAPI pointer.
    • Refactored TVMFFIPyArgSetterDLPackCExporter_ to TVMFFIPyArgSetterDLPackExchangeAPI_ to utilize the new API struct.
    • Introduced GetTorchFallbackExchangeAPI for handling fallback scenarios.
  • python/tvm_ffi/cython/tensor.pxi
    • Refactored DLTensorTestWrapper's DLPack functions into separate _dltensor_test_wrapper_from_pyobject, _dltensor_test_wrapper_to_pyobject, and _dltensor_test_wrapper_current_work_stream functions.
    • Implemented _dltensor_test_wrapper_get_exchange_api to provide a static DLPackExchangeAPI instance for testing purposes.
    • Updated DLTensorTestWrapper to use __c_dlpack_exchange_api__.
  • python/tvm_ffi/cython/tvm_ffi_python_helpers.h
    • Updated TVMFFIPyCallContext and TVMFFIPyArgSetter to use a single const DLPackExchangeAPI* member instead of individual function pointers.
    • Modified TVMFFIPyFuncCall to correctly pass and receive the DLPackExchangeAPI pointer.
  • tests/python/test_load_inline.py
    • Updated test_load_inline_with_env_tensor_allocator to check for the presence of __c_dlpack_exchange_api__.
Activity
  • Kathryn-cat addressed comments from tqchen and indicated that the PR is in good shape, suggesting that unit tests be moved to a separate PR.
  • tqchen provided several review comments, including suggestions for refactoring context fields, clarifying static member initialization, moving implementations to the aten:: namespace, ensuring consistent naming, and updating the DLPack API passing mechanism.
  • Kathryn-cat updated the code multiple times in response to tqchen's feedback.
  • A bot (gemini-code-assist[bot]) identified a broad condition for device type checks in CUDA stream handling and suggested making it specific to kDLCUDA or kDLROCM, which was subsequently incorporated into the code.

@tqchen tqchen merged commit 22a7894 into apache:main Oct 11, 2025
7 checks passed
tqchen pushed a commit that referenced this pull request Oct 13, 2025
…111)

As a followup to PR #96, this PR adds comprehensive unit tests for
`torch.Tensor.__c_dlpack_exchange_api__` using inline C++. It validates
PyTorch's implementation of the `DLPackExchangeAPI` struct-based fast
exchange protocol.

Unlike the ctypes-based tests, these tests use
`torch.utils.cpp_extension.load_inline` to avoid GIL release issues
when calling `THPVariable_Wrap`.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
SigureMo added a commit to cattidea/Paddle that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants