Top: upcoming MPM engine that runs on CPU and GPU using rust-cuda, Bottom: toy path tracer that can run on CPU, GPU, and GPU (hardware raytracing) using recent experiments with OptiX
Today marks an exciting milestone for the Rust CUDA Project, over the past couple of months, we have made significant advancements in supporting many of the fundamental CUDA ecosystem libraries. The main changes in this release are the changes to cust to make future library support possible, but we will also be highlighting some of the WIP experiments we have been conducting.
Cust changes
This release is likely to be the biggest and most breaking change to cust ever, we had to fundamentally rework how many things work to:
- Fix some unsoundness.
- Remove some outdated and inconsistent things.
- Rework how contexts work to be interoperable with the runtime API.
Therefore this release is guaranteed to break your code, however, the changes should not break too much unless you did a lot of lower-level work with device memory constructs.
Cust 0.3 changes
TLDR
This release is gigantic, so here are the main things you need to worry about:
Context::create_and_push(FLAGS, device) -> Context::new(device).
Module::from_str(PTX) -> Module::from_ptx(PTX, &[]).
Context handling overhaul
The way that contexts are handled in cust has been completely overhauled, it now
uses primary context handling instead of the normal driver API context APIs. This
is aimed at future-proofing cust for libraries such as cuBLAS and cuFFT, as well as
overall simplifying the context handling APIs. This does mean that the API changed a bit:
create_and_pushis nownewand it only takes a device, not a device and flags.set_flagsis now used for setting context flags.ContextStack,UnownedContext, and other legacy APIs are gone.
The old context handling is fully present in cust::context::legacy for anyone who needs it for specific reasons. If you use quick_init you don't need to worry about
any breaking changes, the API is the same.
cust_core
DeviceCopy has now been split into its own crate, cust_core. The crate is #![no_std], which allows you to
pull in cust_core in GPU crates for deriving DeviceCopy without cfg shenanigans.
Removed
DeviceBox::wrap, useDeviceBox::from_raw.DeviceSlice::as_ptrandDeviceSlice::as_mut_ptr. UseDeviceSlice::as_device_ptrthenDevicePointer::as_(mut)_ptr.DeviceSlice::chunksand consequentlyDeviceChunks.DeviceSlice::chunks_mutand consequentlyDeviceChunksMut.DeviceSlice::from_sliceandDeviceSlice::from_slice_mutbecause it was unsound.DevicePointer::as_raw_mut(useDevicePointer::as_mut_ptr).DevicePointer::wrap(useDevicePointer::from_raw).DeviceSliceno longer implementsIndexandIndexMut, switching away from[T]made this impossible to implement.
Instead you can now useDeviceSlice::indexwhich behaves the same.vekis no longer re-exported.
Deprecated
Module::from_str, useModule::from_ptxand pass&[]for options.Module::load_from_string, useModule::from_ptx_cstr.
Added
cust::memory::LockedBox, same asLockedBufferexcept for single elements.cust::memory::cuda_malloc_async.cust::memory::cuda_free_async.impl AsyncCopyDestination<LockedBox<T>> for DeviceBox<T>for async HtoD/DtoH memcpy.DeviceBox::new_async.DeviceBox::drop_async.DeviceBox::zeroed_async.DeviceBox::uninitialized_async.DeviceBuffer::uninitialized_async.DeviceBuffer::drop_async.DeviceBuffer::zeroed.DeviceBuffer::zeroed_async.DeviceBuffer::cast.DeviceBuffer::try_cast.DeviceSlice::set_8andDeviceSlice::set_8_async.DeviceSlice::set_16andDeviceSlice::set_16_async.DeviceSlice::set_32andDeviceSlice::set_32_async.DeviceSlice::set_zeroandDeviceSlice::set_zero_async.- the
bytemuckfeature which is enabled by default. - mint integration behind
impl_mint. - half integration behind
impl_half. - glam integration behind
impl_glam. - experimental linux external memory import APIs through
cust::external::ExternalMemory. DeviceBuffer::as_slice.DeviceVariable, a simple wrapper aroundDeviceBox<T>andTwhich allows easy management of a CPU and GPU version of a type.DeviceMemory, a trait describing any region of GPU memory that can be described with a pointer + a length.memcpy_htod, a wrapper aroundcuMemcpyHtoD_v2.mem_get_infoto query the amount of free and total memory.DevicePointer::as_ptrandDevicePointer::as_mut_ptrfor*const Tand*mut T.DevicePointer::from_rawforCUdeviceptr -> DevicePointer<T>with a safe function.DevicePointer::cast.- dependency on
cust_coreforDeviceCopy. ModuleJitOption,JitFallback,JitTarget, andOptLevelfor specifying options when loading a module. Note that
ModuleJitOption::MaxRegistersdoes not seem to work currently, but NVIDIA is looking into it.
You can achieve the same goal by compiling the ptx to cubin using nvcc then loading that:nvcc --cubin foo.ptx -maxrregcount=REGSModule::from_fatbin.Module::from_cubin.Module::from_ptxandModule::from_ptx_cstr.Stream,Module,Linker,Function,Event,UnifiedBox,ArrayObject,LockedBuffer,LockedBox,DeviceSlice,DeviceBuffer, andDeviceBoxall now implSendandSync, this makes
it much easier to write multigpu code. The CUDA API is fully thread-safe except for graph objects.
Changed
zeroedfunctions onDeviceBoxand others are no longer unsafe and instead now requireT: Zeroable. The functions are only available with thebytemuckfeature.Stream::add_callbacknow internally usescuLaunchHostFuncanticipating the deprecation and removal ofcuStreamAddCallbackper the driver docs. This does however mean that the function no longer takes a device status as a parameter and does not execute on context error.Linker::completenow only returns the built cubin, and not the cubin and a duration.- Features such as
vekfor implementing DeviceCopy are nowimpl_cratename, e.g.impl_vek,impl_half, etc. DevicePointer::as_rawnow returns aCUdeviceptrinstead of a*const T.num-complexintegration is now behindimpl_num_complex, notnum-complex.DeviceBoxnow requiresT: DeviceCopy(previously it didn't but almost all its methods did).DeviceBox::from_rawnow takes aCUdeviceptrinstead of a*mut T.DeviceBox::as_device_ptrnow requires&selfinstead of&mut self.DeviceBuffernow requiresT: DeviceCopy.DeviceBufferis nowrepr(C)and is represented by aDevicePointer<T>and ausize.DeviceSlicenow requiresT: DeviceCopy.DeviceSliceis now represented as aDevicePointer<T>and ausize(and is repr(C)) instead of[T]which was definitely unsound.DeviceSlice::as_ptrandDeviceSlice::as_ptr_mutnow both return aDevicePointer<T>.DeviceSliceis nowCloneandCopy.DevicePointer::as_rawnow returns aCUdeviceptr, not a*const T(useDevicePointer::as_ptr).- Fixed typo in
CudaError,InvalidSouceis nowInvalidSource, no more invalid sauce 🍅🥣
Line tables
The libnvvm codegen can now generate line tables while optimizing (previously it could generate debug info but not optimize), which allows you to debug and profile kernels much better in tools like Nsight Compute. You can enable debug info creation using .debug(DebugInfo::LineTables) with cuda_builder.
OptiX
Using the generous work of @anderslanglands, we were able to get rust-cuda to target hardware raytracing completely in rust (both for the host and the device). The toy path tracer example has been ported to be able to use hardware rt as a backend, however, optix and optix_device are not published on crates.io yet since they are still highly experimental.

using hardware rt to render a simple mesh
cuBLAS
Work on supporting cuBLAS through a high-level wrapper library has started, a lot of work needed to be done in cust to interop with cuBLAS which is a runtime API based library. This required some changes with how cust handles contexts to avoid dropping context resources cuBLAS was using. The library is not yet published but eventually will be once it is more complete. cuBLAS is a big piece of neural network training on the GPU so it is critical to support it.
cuDNN
@frjnn has been generously working on wrapping the cuDNN library. cuDNN is the primary tool used to train neural networks on the GPU, and the primary tool used by pytorch and tensorflow. High level bindings to cuDNN are a major step to making Machine Learning in Rust a viable option. This work is still very in-progress so it is not published yet, it will be published once it is usable and will likely first be used in neuronika for GPU neural network training.
Atomics
Work on supporting GPU-side atomics in cuda_std has started, some preliminary work is already published in cuda_std, however, it is still very in-progress and subject to change. Atomics are a difficult issue due to the vast amount of options available for GPU atomics, including:
- Different atomic scopes, device, system, or block.
- Specialized instructions or emulated depending on the compute capability target.
- Hardware float atomics (which core does not have)
You can read more about it here.


