Skip to content

Conversation

@ZuseZ4
Copy link
Member

@ZuseZ4 ZuseZ4 commented Jan 5, 2026

This intrinsic helps with supporting the various AMD & NVIDIA libraries like rocBLAS or cuBLAS.
They provide functions which must be called from the host, but require a mixture of host and device pointers.
This offload_args intrinsic maps our host allocations to device allocations and transfers memory as required.
It reuses the whole infrastructure which we already have for the main offload intrinsic.
Unlike the main offload intrinsic, this also already fully works with std. I also got it to work with a single cargo invocation:
RUSTFLAGS="-L native=/opt/rocm-6.4.0/lib -l dylib=rocblas -l dylib=amdhip64 -l dylib=omp -l dylib=omptarget -Zoffload=Args -Zunstable-options" cargo +offload run -r

I'll rebase and drop the first 3 commits once the cleanup PR lands.

I updated compiler/rustc_monomorphize/src/collector/autodiff.rs, it now works without no_mangle, otherwise the function won't be codegen'ed. It also works without lto=fat if we only have main.rs
If we put a function in lib.rs and call it in main.rs, then it currently trips the verifier. Happy to fix it either here or in a follow-up PR:

thread 'rustc' (494962) panicked at compiler/rustc_monomorphize/src/collector.rs:468:13:
assertion failed: tcx.should_codegen_locally(instance)
stack backtrace:

cc @kevinsala @Sa4dUs

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 5, 2026
@ZuseZ4 ZuseZ4 added the F-gpu_offload `#![feature(gpu_offload)]` label Jan 5, 2026
@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 force-pushed the offload-host-intrinsic branch from 020f669 to 555131e Compare January 5, 2026 15:21
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 mentioned this pull request Jan 6, 2026
5 tasks
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 force-pushed the offload-host-intrinsic branch 2 times, most recently from 020f669 to 0228337 Compare January 14, 2026 00:46
@ZuseZ4 ZuseZ4 marked this pull request as ready for review January 18, 2026 22:15
@rustbot
Copy link
Collaborator

rustbot commented Jan 18, 2026

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 18, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 18, 2026

r? @mati865

rustbot has assigned @mati865.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@ZuseZ4
Copy link
Member Author

ZuseZ4 commented Jan 18, 2026

r? @oli-obk

@rustbot rustbot assigned oli-obk and unassigned mati865 Jan 18, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 18, 2026

oli-obk is not on the review rotation at the moment.
They may take a while to respond.


#[rustc_nounwind]
#[rustc_intrinsic]
pub const fn offload_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a blocker for reviews.^^

@rustbot author

Copy link
Member Author

@ZuseZ4 ZuseZ4 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details ^^ Here you go.

@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU. I will make a follow-up PR, in which I will add support for forwarding host arguments that should not be mapped. In the case of the rocBLAS test, those would be alpha and beta. Once we have that, I think I could prohibit calling a Rust function, and enforce that f is a ForeignFn to prevent this risk, if you think it's needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about this to have an opinion on that question.^^

But we can bikeshed the intrinsic as a whole a bit.

  • Why do we need both offload and offload_args?
  • Given that offload also has arguments, the name doesn't seem very descriptive. Maybe offload_with_host_helper or so? Just spitballing here as I have know idea how this all actually gets used. ;)

Copy link
Member Author

@ZuseZ4 ZuseZ4 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offload is there to map all arguments to the GPU, and then launch a GPU kernel. It has a couple of extra args, to tell offload (ol) how many threads and blocks we want when launching it. That gpu kernel comes from a Rust function, which we compiled to the GPU.

offload_args also maps all the arguments to the GPU, and then launches a CPU function, which accepts pointers to GPU memory. That's relying on the called CPU function independently launching a GPU kernel. A lot of highly optimized libraries like cuBLAS, rocBLAS, cuDNN, etc. already do that. If you try to launch them with offload you will get a failure, because they are not meant to be launched from the GPU. If you launch them but pass CPU pointers, you will also get a failure, because all arrays are supposed to be in GPU memory. In the past NVIDIA used to partly support calling those functions from the GPU, but they dropped support. I could probably try to unify both offload intrinsics under one name, but with autodiff we also decided to split it into autodiff_forward and autodiff_reverse since they were behaving too differently, which otherwise made it confusing.

Under the hood, both create an offload_begin_mapper which copies data to the GPU, and an offload_end_mapper, which copies them back. The main intrinsic then also has a tgt_target_kernel_launch, which calls the GPU kernel. The offload_args intrinsic just calls the host function, but as you can see in the codegen test, we swap out the pointers before. This gives us the benefit of being able to optimize them both the same, so now both can use all the LLVM profiling/debugging/visualization tools (not yet tested, but should work) and both benefit from the LLVM opts that we started writing for them.

Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.

Does that help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds a bit like offload_args is a lower-level primitive that could be used to implement offload... but it also seems like that's not actually how it works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They share a lot of the implementation, that's why this PR is so small, especially once the first 3 refactor commits from a different PR are merged. If you look at the commits, you'll see that I've added a boolean host flag. If it is set to true, we directly call the host function given to the intrinsic. The compilation pipeline in this case is also shorten, and a few of our optional args are set to None, since we don't need to generate all that much info for offload.

If the host boolean is set to false, we generate a __tgt_target_kernel (I love their naming style) and launch the function given to the intrinsic through that. This requires that we have previously compiled the function to the GPU, and embedded it's bitcode into our IR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so... I take it that it does not make sense to implement offset on top of offset_args in "userland" (rather than in codegen).

Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.

I see. Happy to take input some more folks on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not that familiar with offload, so I’ll need help on the details, but this is how I understood it:

There are two (and a half?) primitive operations involved:

  1. Converting a CPU pointer to a GPU pointer (and the other half: converting it back)
  2. Launching a kernel on a GPU with some given arguments

The current offload implements both of these. However, for using cuda/rocm libraries, you want to be able to only do 1 without also doing 2, right?

On a low level, 1 and 2 really are orthogonal operations, so it would make sense to split them into two separate things.
Though the details of actually implementing that are more complicated. offload is a rather high-level api that hides things going on underneath, including cpu vs gpu memory/pointers.

The interface of doing the copy to and back around a closure seems nice, it looks like scoped threads and similar patterns.
Does it anything else than copy memory? Then we could also call it offload_copy_mem_args, offload_mem or offload_ptrs (or maybe something with ‘scoped’?).

I think we can the expose the two primitive operations as intrinsics:

/// Transfer memory referenced by args back and forth
pub fn offload_copy_mem_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;

/// Call a kernel.
/// Does **not** copy memory (note: also no restriction on T)
pub fn offload_call_kernel<F>(args: T) -> R;

This composes nicely to implement offload in code:

pub fn offload<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R {
    offload_copy_mem_args(|offloaded_args| {
        offload_call_kernel(f, offloaded_args)
    })
}

And it gets rid of the bool argument to trigger one or the other implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying here since it seems you posted that in the wrong thread

Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.

But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.

This sounds like you are discussing the public, user-visible API. In this PR we are discussing the underlying language primitives (intrinsics) powering that API. Those don't have the same concerns. No user should ever directly call these intrinsics. The intrinsics should be designed for

  • ease of specification
  • ease of implementation in backends

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2026
@ZuseZ4 ZuseZ4 force-pushed the offload-host-intrinsic branch from 0228337 to e0fd81c Compare January 19, 2026 23:00
@rustbot
Copy link
Collaborator

rustbot commented Jan 19, 2026

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@ZuseZ4 ZuseZ4 force-pushed the offload-host-intrinsic branch from e0fd81c to 1a22802 Compare January 20, 2026 01:30
@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 20, 2026

☔ The latest upstream changes (presumably #151395) made this pull request unmergeable. Please resolve the merge conflicts.

/// }
///
/// #[cfg(not(target_os = "linux"))]
/// fn sgemv_wrapper(A: &[f32; 6], x: &[f32; 3], y: &mut [f64; 2]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A here is a reference to invalid memory, as the pointer behind A is only valid on the GPU but this code runs on the CPU (also the cfg(not(target_os should probably be cfg(target_os without the not).

It works out fine here because A is never accessed, but I think it violates Rust’s guarantees.
To be safe, offload_args needs to map the type as well. I.e.

offload_args::<_, (&T, &[T])>(|t: *const T, ts: (*const T, usize)| {}, (t, &[t]));
// etc. for other mapped types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, that's what I was mentioning above to Ralf: https://github.com/rust-lang/rust/pull/150683/changes#r2706310846 That means some safe usages can be currently unsound till the follow-up pr lands, but it's a half implemented rustc intrinsic (which end-users aren't supposed to use anyway), so I'd argue it's ok. The alternative is to make this PR bigger, but it clearly already takes a while to review, so I'd rather have that separate.

Generally, I am very willing to put in significant effort to preserve noalias for the GPU kernels, which your type mapping I think does not do. If you watch the slides from my US LLVM dev talk, they show how I intend to keep it even for mut args like &mut [f32].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.

But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That link doesn't work here so I am not sure what you are referring to. But I definitely didn't realize that the example contains unsound code, or else I would have made a fuzz about it. ;)

some safe usages can be currently unsound

The problem with the example isn't about being safe or not, it's about having UB. Examples with UB without a big fat warning sign are not okay.

Is there a better example that avoids the UB? Like, if you made all these references into raw pointers, would that work?

Also, this affects not just A but also x and y, wouldn't it? They are all GPU pointers now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gh being lovely again. I was referring to my comment here:

@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU.

Yes, A,x,y are GPU pointers. That was why I had asked you about only supporting extern ForeignFn once the follow-up pr lands. The easiest solution for now is indeed to move this to raw pointers, till we have either of the follow-up pr's landed which make it safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intrinsic is unsafe, so it's entirely fine if some uses of it cause UB. But the docs for the intrinsic should be clear about what exactly is or isn't UB, or if that is yet to be determined then they should be clear about that.

Also, the example shouldn't be UB. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. F-gpu_offload `#![feature(gpu_offload)]` S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants