Adding a new offload_args intrinsic, which only maps arguments #150683

ZuseZ4 · 2026-01-05T00:53:15Z

This intrinsic helps with supporting the various AMD & NVIDIA libraries like rocBLAS or cuBLAS.
They provide functions which must be called from the host, but require a mixture of host and device pointers.
This offload_args intrinsic maps our host allocations to device allocations and transfers memory as required.
It reuses the whole infrastructure which we already have for the main offload intrinsic.
Unlike the main offload intrinsic, this also already fully works with std. I also got it to work with a single cargo invocation:
RUSTFLAGS="-L native=/opt/rocm-6.4.0/lib -l dylib=rocblas -l dylib=amdhip64 -l dylib=omp -l dylib=omptarget -Zoffload=Args -Zunstable-options" cargo +offload run -r

I'll rebase and drop the first 3 commits once the cleanup PR lands.

I updated compiler/rustc_monomorphize/src/collector/autodiff.rs, it now works without no_mangle, otherwise the function won't be codegen'ed. It also works without lto=fat if we only have main.rs
If we put a function in lib.rs and call it in main.rs, then it currently trips the verifier. Happy to fix it either here or in a follow-up PR:

thread 'rustc' (494962) panicked at compiler/rustc_monomorphize/src/collector.rs:468:13:
assertion failed: tcx.should_codegen_locally(instance)
stack backtrace:

cc @kevinsala @Sa4dUs

rustbot · 2026-01-18T22:15:16Z

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

rustbot · 2026-01-18T22:15:19Z

r? @mati865

rustbot has assigned @mati865.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

ZuseZ4 · 2026-01-18T22:15:28Z

r? @oli-obk

rustbot · 2026-01-18T22:15:33Z

oli-obk is not on the review rotation at the moment.
They may take a while to respond.

ZuseZ4 · 2026-01-18T22:22:12Z

library/core/src/intrinsics/mod.rs


+#[rustc_nounwind]
+#[rustc_intrinsic]
+pub const fn offload_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;


Yeah that's a blocker for reviews.^^

@rustbot author

Details ^^ Here you go.

@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU. I will make a follow-up PR, in which I will add support for forwarding host arguments that should not be mapped. In the case of the rocBLAS test, those would be alpha and beta. Once we have that, I think I could prohibit calling a Rust function, and enforce that f is a ForeignFn to prevent this risk, if you think it's needed?

I don't know enough about this to have an opinion on that question.^^

But we can bikeshed the intrinsic as a whole a bit.

Why do we need both offload and offload_args?

Given that offload also has arguments, the name doesn't seem very descriptive. Maybe offload_with_host_helper or so? Just spitballing here as I have know idea how this all actually gets used. ;)

offload is there to map all arguments to the GPU, and then launch a GPU kernel. It has a couple of extra args, to tell offload (ol) how many threads and blocks we want when launching it. That gpu kernel comes from a Rust function, which we compiled to the GPU.

offload_args also maps all the arguments to the GPU, and then launches a CPU function, which accepts pointers to GPU memory. That's relying on the called CPU function independently launching a GPU kernel. A lot of highly optimized libraries like cuBLAS, rocBLAS, cuDNN, etc. already do that. If you try to launch them with offload you will get a failure, because they are not meant to be launched from the GPU. If you launch them but pass CPU pointers, you will also get a failure, because all arrays are supposed to be in GPU memory. In the past NVIDIA used to partly support calling those functions from the GPU, but they dropped support. I could probably try to unify both offload intrinsics under one name, but with autodiff we also decided to split it into autodiff_forward and autodiff_reverse since they were behaving too differently, which otherwise made it confusing.

Under the hood, both create an offload_begin_mapper which copies data to the GPU, and an offload_end_mapper, which copies them back. The main intrinsic then also has a tgt_target_kernel_launch, which calls the GPU kernel. The offload_args intrinsic just calls the host function, but as you can see in the codegen test, we swap out the pointers before. This gives us the benefit of being able to optimize them both the same, so now both can use all the LLVM profiling/debugging/visualization tools (not yet tested, but should work) and both benefit from the LLVM opts that we started writing for them.

Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.

Does that help?

It sounds a bit like offload_args is a lower-level primitive that could be used to implement offload... but it also seems like that's not actually how it works?

They share a lot of the implementation, that's why this PR is so small, especially once the first 3 refactor commits from a different PR are merged. If you look at the commits, you'll see that I've added a boolean host flag. If it is set to true, we directly call the host function given to the intrinsic. The compilation pipeline in this case is also shorten, and a few of our optional args are set to None, since we don't need to generate all that much info for offload.

If the host boolean is set to false, we generate a __tgt_target_kernel (I love their naming style) and launch the function given to the intrinsic through that. This requires that we have previously compiled the function to the GPU, and embedded it's bitcode into our IR.

Okay, so... I take it that it does not make sense to implement offset on top of offset_args in "userland" (rather than in codegen).

Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.

I see. Happy to take input some more folks on that.

I’m not that familiar with offload, so I’ll need help on the details, but this is how I understood it:

There are two (and a half?) primitive operations involved:

Converting a CPU pointer to a GPU pointer (and the other half: converting it back)

Launching a kernel on a GPU with some given arguments

The current offload implements both of these. However, for using cuda/rocm libraries, you want to be able to only do 1 without also doing 2, right?

On a low level, 1 and 2 really are orthogonal operations, so it would make sense to split them into two separate things.
Though the details of actually implementing that are more complicated. offload is a rather high-level api that hides things going on underneath, including cpu vs gpu memory/pointers.

The interface of doing the copy to and back around a closure seems nice, it looks like scoped threads and similar patterns.
Does it anything else than copy memory? Then we could also call it offload_copy_mem_args, offload_mem or offload_ptrs (or maybe something with ‘scoped’?).

I think we can the expose the two primitive operations as intrinsics:

/// Transfer memory referenced by args back and forth pub fn offload_copy_mem_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R; /// Call a kernel. /// Does **not** copy memory (note: also no restriction on T) pub fn offload_call_kernel<F>(args: T) -> R;

This composes nicely to implement offload in code:

pub fn offload<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R { offload_copy_mem_args(|offloaded_args| { offload_call_kernel(f, offloaded_args) }) }

And it gets rid of the bool argument to trigger one or the other implementation.

Replying here since it seems you posted that in the wrong thread

Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.

But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.

This sounds like you are discussing the public, user-visible API. In this PR we are discussing the underlying language primitives (intrinsics) powering that API. Those don't have the same concerns. No user should ever directly call these intrinsics. The intrinsics should be designed for

ease of specification

ease of implementation in backends

compiler/rustc_session/src/options.rs

rustbot · 2026-01-19T23:00:43Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rust-bors · 2026-01-20T13:09:13Z

☔ The latest upstream changes (presumably #151395) made this pull request unmergeable. Please resolve the merge conflicts.

Flakebi · 2026-01-21T15:43:39Z

library/core/src/intrinsics/mod.rs

+/// }
+///
+/// #[cfg(not(target_os = "linux"))]
+/// fn sgemv_wrapper(A: &[f32; 6], x: &[f32; 3], y: &mut [f64; 2]) {


A here is a reference to invalid memory, as the pointer behind A is only valid on the GPU but this code runs on the CPU (also the cfg(not(target_os should probably be cfg(target_os without the not).

It works out fine here because A is never accessed, but I think it violates Rust’s guarantees.
To be safe, offload_args needs to map the type as well. I.e.

offload_args::<_, (&T, &[T])>(|t: *const T, ts: (*const T, usize)| {}, (t, &[t])); // etc. for other mapped types

Yes you are right, that's what I was mentioning above to Ralf: https://github.com/rust-lang/rust/pull/150683/changes#r2706310846 That means some safe usages can be currently unsound till the follow-up pr lands, but it's a half implemented rustc intrinsic (which end-users aren't supposed to use anyway), so I'd argue it's ok. The alternative is to make this PR bigger, but it clearly already takes a while to review, so I'd rather have that separate.

Generally, I am very willing to put in significant effort to preserve noalias for the GPU kernels, which your type mapping I think does not do. If you watch the slides from my US LLVM dev talk, they show how I intend to keep it even for mut args like &mut [f32].

Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.

But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.

That link doesn't work here so I am not sure what you are referring to. But I definitely didn't realize that the example contains unsound code, or else I would have made a fuzz about it. ;)

some safe usages can be currently unsound

The problem with the example isn't about being safe or not, it's about having UB. Examples with UB without a big fat warning sign are not okay.

Is there a better example that avoids the UB? Like, if you made all these references into raw pointers, would that work?

Also, this affects not just A but also x and y, wouldn't it? They are all GPU pointers now?

Ah, gh being lovely again. I was referring to my comment here:

@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU.

Yes, A,x,y are GPU pointers. That was why I had asked you about only supporting extern ForeignFn once the follow-up pr lands. The easiest solution for now is indeed to move this to raw pointers, till we have either of the follow-up pr's landed which make it safe.

The intrinsic is unsafe, so it's entirely fine if some uses of it cause UB. But the docs for the intrinsic should be clear about what exactly is or isn't UB, or if that is yet to be determined then they should be clear about that.

Also, the example shouldn't be UB. :)

ZuseZ4 added the F-gpu_offload `#![feature(gpu_offload)]` label Jan 5, 2026

This comment has been minimized.

Sign in to view

ZuseZ4 force-pushed the offload-host-intrinsic branch from 020f669 to 555131e Compare January 5, 2026 15:21

This comment has been minimized.

Sign in to view

ZuseZ4 mentioned this pull request Jan 6, 2026

Tracking Issue for GPU-offload #131513

Open

5 tasks

This comment has been minimized.

Sign in to view

ZuseZ4 mentioned this pull request Jan 10, 2026

offload: move (un)register lib into global_ctors #150893

Open

ZuseZ4 force-pushed the offload-host-intrinsic branch from 72feb1c to b6f4295 Compare January 10, 2026 23:11

This comment has been minimized.

Sign in to view

ZuseZ4 force-pushed the offload-host-intrinsic branch 2 times, most recently from 020f669 to 0228337 Compare January 14, 2026 00:46

ZuseZ4 mentioned this pull request Jan 16, 2026

Finish the std::offload module rust-lang/rust-project-goals#109

Open

8 tasks

ZuseZ4 marked this pull request as ready for review January 18, 2026 22:15

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 18, 2026

rustbot assigned mati865 Jan 18, 2026

rustbot assigned oli-obk and unassigned mati865 Jan 18, 2026

ZuseZ4 commented Jan 18, 2026

View reviewed changes

saethlin reviewed Jan 19, 2026

View reviewed changes

compiler/rustc_session/src/options.rs Outdated Show resolved Hide resolved

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2026

ZuseZ4 added 4 commits January 19, 2026 11:49

move initialization of omp/ol runtimes into global_ctor/dtor

7794484

Add an early return if handling multiple offload calls

788a31e

adjust testcase for new logic

7f83187

implement offload_args

22f1c8d

ZuseZ4 force-pushed the offload-host-intrinsic branch from 0228337 to e0fd81c Compare January 19, 2026 23:00

ZuseZ4 added 2 commits January 19, 2026 17:23

add test for new offload_args intrinsic

58d1a17

Update rustc to handle mangled functions in offload_args, adjust test

1a22802

ZuseZ4 force-pushed the offload-host-intrinsic branch from e0fd81c to 1a22802 Compare January 20, 2026 01:30

Flakebi reviewed Jan 21, 2026

View reviewed changes

Uh oh!

Adding a new offload_args intrinsic, which only maps arguments #150683

Are you sure you want to change the base?

Adding a new offload_args intrinsic, which only maps arguments #150683

Uh oh!

Conversation

ZuseZ4 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

rustbot commented Jan 18, 2026

Uh oh!

rustbot commented Jan 18, 2026

Uh oh!

ZuseZ4 commented Jan 18, 2026

Uh oh!

rustbot commented Jan 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZuseZ4 Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZuseZ4 Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rustbot commented Jan 19, 2026

Uh oh!

rust-bors bot commented Jan 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ZuseZ4 commented Jan 5, 2026 •

edited

Loading

ZuseZ4 Jan 19, 2026 •

edited

Loading

ZuseZ4 Jan 20, 2026 •

edited

Loading