Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enzyme failed to support atomicAdd, atomicCAS, and assert for cuda code #2053

Open
minansys opened this issue Aug 27, 2024 · 8 comments
Open

Comments

@minansys
Copy link

minansys commented Aug 27, 2024

I am using enzyme for gpu code. Here is the some information

clang: 15.0
nvcc: cuda 11.8
WSL linux: Ubuntu 20.04 LTS, 
Enzyme:
  commit 2fe5164a2423dd67ef25e2c4fb204fd06362494b (HEAD -> main, origin/main, origin/HEAD)
  Author: William Moses <[email protected]>
  Date:   Thu Aug 15 23:00:59 2024 -0500
Build:
  clang++ -I/usr/local/cuda-11.8/include example.cu -o example -O2  --cuda-path=/usr/local/cuda-11.8/ --cuda-gpu-arch=sm_86 -L/usr/local/cuda-11.8/lib64 -lcudart -v -DENABLE_ENZYME -fplugin=/home/mixu/software/Enzyme/enzyme/build/Enzyme/ClangEnzyme-15.so -mllvm -enzyme-loose-types=1
#include <stdio.h>
#include <assert.h>

#define enzyme_device_func __device__

extern int enzyme_device_func enzyme_dup;
extern int enzyme_device_func enzyme_dupnoneed;
extern int enzyme_device_func enzyme_out;
extern int enzyme_device_func enzyme_const;

template < typename return_type, typename ... T >
enzyme_device_func return_type __enzyme_fwddiff(void*, T ... );

template < typename return_type, typename ... T >
enzyme_device_func return_type __enzyme_autodiff(void*, T ... );

#define dev std
#define Real double
//#define FORWARD

// AtomicAdd -----------------------------

// #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
// __device__ static double atomicAdd(double* address, double val)
// {
// 	unsigned long long int* address_as_ull = (unsigned long long int*)address;
// 	unsigned long long int old = *address_as_ull, assumed;
// 	do
// 	{
// 		assumed = old;
// 		old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
// 		// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
// 	}
// 	while (assumed != old);
// 	return __longlong_as_double(old);
// }
// #endif

template <class T>
__device__ static T AtomicAdd(T& dest, const T& val)
{
    return atomicAdd(&dest, val);
}

// AtomicMax -----------------------------		
__device__ static double atomicMax(double* address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do
    {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed, __double_as_longlong(dev::fmax(val, __longlong_as_double(assumed))));
        // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    }
    while (assumed != old);
    return old;
}

template <class T>
__device__ static T AtomicMax(T& dest, const T& val)
{
    return atomicMax(&dest, val);
}

void __device__ square_impl(Real* x_in, Real *x_out) {
    //assert(x_in[0] == 0.0);
    //x_out[0] = x_in[0] * x_in[0];
    AtomicMax(x_out[0], x_in[0]);
    //AtomicAdd(x_out[0], x_in[0]);
}

typedef void (*f_ptr)(Real*, Real*);

void __global__ square(Real* x_in, Real *x_out) {
    square_impl(x_in, x_out);
}

void __global__ square_grad(Real* x, Real *d_x, Real *y, Real *d_y) {
    __enzyme_autodiff<void>((void *)(square_impl),
        enzyme_dup, x, d_x,
        enzyme_dup, y, d_y);
}

int main() {

    // Device pointers
    Real *x, *d_x, *y, *d_y;

    // Allocate GPU device memory
    cudaMalloc(&x, sizeof(*x));
    cudaMalloc(&d_x, sizeof(*d_x));
    cudaMalloc(&y, sizeof(*y));
    cudaMalloc(&d_y, sizeof(*d_y));

    // Initialize device values
    Real host_x = 1.4;
    Real host_d_x = 0.0;
    Real host_y = 1.0;
    Real host_d_y = 1.0;

    // Copy data to device
    cudaMemcpy(x, &host_x, sizeof(*x), cudaMemcpyHostToDevice);
    cudaMemcpy(d_x, &host_d_x, sizeof(*d_x), cudaMemcpyHostToDevice);
    cudaMemcpy(y, &host_y, sizeof(*y), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, &host_d_y, sizeof(*d_y), cudaMemcpyHostToDevice);

#ifdef FORWARD
    // Forward pass only
    square<<<1, 1>>>(x, y);
#else
    // Forward and backward pass
    //square_grad<<<1, 1>>>(x, d_x, y, d_y);  
#endif

    // Synchronize device memory
    cudaDeviceSynchronize();

    // Copy data from device to host
    cudaMemcpy(&host_x, x, sizeof(*x), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_d_x, d_x, sizeof(*d_x), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_y, y, sizeof(*y), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_d_y, d_y, sizeof(*d_y), cudaMemcpyDeviceToHost);

    // Print results
    printf("%f %f\n", host_x, host_y);
    printf("%f %f\n", host_d_x, host_d_y);
    
}

in square_impl if I use AtomicMax, it failed to compile with error msg:


example.cu:65:17: error: Enzyme: <analysis>
  %6 = phi i64 [ %4, %2 ], [ %11, %5 ]: {[-1]:Float@double}, intvals: {}
  %10 = cmpxchg ptr %1, i64 %6, i64 %9 seq_cst seq_cst, align 8: {}, intvals: {}
  %9 = bitcast double %8 to i64: {}, intvals: {}
  %8 = tail call contract double @__nv_fmax(double noundef %3, double noundef %7) #149: {[-1]:Float@double}, intvals: {}
ptr %0: {[-1]:Pointer, [-1,0]:Float@double}, intvals: {}
ptr %1: {[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer, [-1,4]:Integer, [-1,5]:Integer, [-1,6]:Integer, [-1,7]:Integer}, intvals: {}
  %12 = icmp eq i64 %6, %11: {[-1]:Integer}, intvals: {}
  %11 = extractvalue { i64, i1 } %10, 0: {}, intvals: {}
  %7 = bitcast i64 %6 to double: {[-1]:Float@double}, intvals: {}
  %4 = load i64, ptr %1, align 8, !tbaa !12: {[-1]:Integer}, intvals: {}
  %3 = load double, ptr %0, align 8, !tbaa !8: {[-1]:Float@double}, intvals: {}
</analysis>
Illegal updateAnalysis prev:{[-1]:Integer} new: {[-1]:Float@double}
val:   %4 = load i64, ptr %1, align 8, !tbaa !12 origin=  %6 = phi i64 [ %4, %2 ], [ %11, %5 ]
void __device__ square_impl(Real* x_in, Real *x_out) {
                ^

in square_impl if I use AtomicAdd, it fails to compile with the error msg:

fatal error: error in backend: Cannot select: 0x558d7c69bb10: f64,ch = AtomicLoad<(load seq_cst (s64) from %ir.5, addrspace 1)> 0x558d7c39d080:1, 0x558d7c69da30
  0x558d7c69da30: i64 = addrspacecast[0 -> 1] 0x558d7c463798
    0x558d7c463798: i64,ch = load<(dereferenceable invariant load (s64) from `ptr addrspace(101) null`, addrspace 101)> 0x558d7c432ca8, TargetExternalSymbol:i64'_Z11square_gradPdS_S_S__param_3', undef:i64
      0x558d7c69b1b8: i64 = TargetExternalSymbol'_Z11square_gradPdS_S_S__param_3'
      0x558d7c4635f8: i64 = undef
In function: _Z11square_gradPdS_S_S_
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Ubuntu clang version 15.0.7
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
clang: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/example-d21287.cu
clang: note: diagnostic msg: /tmp/example-01fc49/example-sm_86.cu
clang: note: diagnostic msg: /tmp/example-d21287.sh
clang: note: diagnostic msg: 

********************

in square_impl if I use assert(x_out[0] == 0.0);, it fails to compile with the error msg:

error: Enzyme: No create nofree of empty function (__assertfail) __assertfail)
 at context:   tail call void @__assertfail(ptr noundef nonnull @.str, ptr noundef nonnull @.str1, i32 noundef 67, ptr noundef nonnull @__PRETTY_FUNCTION__._Z11square_implPdS_, i64 noundef 1) #149 (__assertfail)
1 error generated when compiling for sm_86.

The forward codes comply and run fine with the __enzyme_autodiff function commented out.

@minansys minansys changed the title Enzyme failed to support atomicAdd and atomicCAS for cuda code Enzyme failed to support atomicAdd, atomicCAS, and assert for cuda code Aug 28, 2024
@minansys
Copy link
Author

Hi @wsmoses Thank you for your assistance. With the latest build, the assert issue has been resolved. However, CUDA does not provide an atomicMax function out of the box, so we need to use atomicCAS to achieve that functionality. Unfortunately, it seems that atomicCAS is not supported by Enzyme. Additionally, atomicAdd does not support double precision before SM_60, which also necessitates the use of atomicCAS. Even for float precision, I have been unable to compile atomicAdd with Enzyme. Any tips or workarounds for these issues would be greatly appreciated.

@minansys
Copy link
Author

minansys commented Oct 21, 2024

I tried to use custom gradient to bypass this issues. However the CPU code did not work

#include <stdio.h>

extern int enzyme_dup;
extern int enzyme_out;

template <typename return_type, typename... T>
return_type __enzyme_autodiff(void*, T...);

#define Real double

// CPU version of AtomicAdd (simple addition)
__attribute__((noinline))
void AtomicAdd(double* dest, const double* val) {
    *dest += *val;
}

// Augmented forward function for AtomicAdd (for Enzyme differentiation)
void augment_AtomicAdd(double* dest, double* d_dest, const double* val, const double* d_val) {
    // Perform the forward pass operation (addition) and mark changes for differentiation
    AtomicAdd(dest, val);
    *d_dest += *d_val;  // Update the gradient for dest
}

// Reverse function for AtomicAdd (for Enzyme differentiation)
void reverse_AtomicAdd(const double* dest, const double* d_dest, const double* val, double* d_val, void* tape) {
    //printf("reverse_AtomicAdd called\n");
    *d_val += (*d_dest);  // Compute gradient for val from reverse pass
}

// Register the custom gradient functions with Enzyme
void* __enzyme_register_gradient_AtomicAdd[] = {
    (void*)AtomicAdd,
    (void*)augment_AtomicAdd,
    (void*)reverse_AtomicAdd,
};

// Simple square operation (CPU version)
void square_impl(Real* x_in, Real* x_out) {
    AtomicAdd(x_out, x_in);
}

// Forward and backward pass for square (CPU version)
void square_grad(Real* x, Real* d_x, Real* y, Real* d_y) {
    __enzyme_autodiff<void>((void*)(square_impl),
                            enzyme_dup, x, d_x,
                            enzyme_dup, y, d_y);
}

int main() {    
    Real x = 1.4;
    Real d_x = 0.0;
    Real y = 1.0;
    Real d_y = 1.0;

    // Forward and backward pass (with differentiation)
    square_grad(&x, &d_x, &y, &d_y);

    // Print results after the forward and backward pass
    printf("x: %f, y: %f\n", x, y);
    printf("d_x: %f, d_y: %f\n", d_x, d_y);

    return 0;
}

it crashed with line 21: 381994 Segmentation fault (core dumped) ./a.out

with compile options clang++ -DNDEBUG -g -O0 -fplugin=/home/mixu/software/Enzyme/enzyme/build/Enzyme/ClangEnzyme-15.so example_cpu_custom.cpp

@minansys
Copy link
Author

minansys commented Oct 21, 2024

The gpu version of the custom gradient does not even visit the registered custom gradient

#include <stdio.h>
#include <assert.h>

#define enzyme_device_func __device__

extern int enzyme_device_func enzyme_dup;
extern int enzyme_device_func enzyme_dupnoneed;
extern int enzyme_device_func enzyme_out;
extern int enzyme_device_func enzyme_const;

template < typename return_type, typename ... T >
enzyme_device_func return_type __enzyme_autodiff(void*, T ... );

#define dev std
#define Real double
//#define FORWARD

__device__ void AtomicAdd(double* dest, const double* val)
{
    //atomicAdd(&dest, val);
    *dest += *val;
}

// Augmented forward function for AtomicAdd
__device__ void augment_AtomicAdd(double* dest, double* d_dest, const double* val, const double* d_val) {
    // Save the old value of dest in tape
    AtomicAdd(dest, val);
    *d_dest += *d_val;
}

// Reverse function for AtomicAdd
__device__ void reverse_AtomicAdd(double* dest, const double* d_dest, const double* val, double* d_val, void** tape) {
    printf("reverse_AtomicAdd\n");
    *d_val += (*d_dest);
}

// Register gradient functions in host code
void* __enzyme_register_gradient_AtomicAdd[3] = { 
    (void*)AtomicAdd,             // Original atomic add
    (void*)augment_AtomicAdd,     // Augmented forward
    (void*)reverse_AtomicAdd      // Reverse pass
};
void __device__ square_impl(Real* x_in, Real *x_out) {
    AtomicAdd(&(x_out[0]), &(x_in[0]));
}

typedef void (*f_ptr)(Real*, Real*);

void __global__ square(Real* x_in, Real *x_out) {
    square_impl(x_in, x_out);
}

void __global__ square_grad(Real* x, Real *d_x, Real *y, Real *d_y) {
    __enzyme_autodiff<void>((void *)(square_impl),
        enzyme_dup, x, d_x,
        enzyme_dup, y, d_y);
}

int main() {

    // Device pointers
    Real *x, *d_x, *y, *d_y;

    // Allocate GPU device memory
    cudaMalloc(&x, sizeof(*x));
    cudaMalloc(&d_x, sizeof(*d_x));
    cudaMalloc(&y, sizeof(*y));
    cudaMalloc(&d_y, sizeof(*d_y));

    // Initialize device values
    Real host_x = 1.4;
    Real host_d_x = 0.0;
    Real host_y = 1.0;
    Real host_d_y = 1.0;

    // Copy data to device
    cudaMemcpy(x, &host_x, sizeof(*x), cudaMemcpyHostToDevice);
    cudaMemcpy(d_x, &host_d_x, sizeof(*d_x), cudaMemcpyHostToDevice);
    cudaMemcpy(y, &host_y, sizeof(*y), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, &host_d_y, sizeof(*d_y), cudaMemcpyHostToDevice);

#ifdef FORWARD
    // Forward pass only
    square<<<1, 1>>>(x, y);
#else
    // Forward and backward pass
    square_grad<<<1, 1>>>(x, d_x, y, d_y);  
#endif

    // Synchronize device memory
    cudaDeviceSynchronize();

    // Copy data from device to host
    cudaMemcpy(&host_x, x, sizeof(*x), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_d_x, d_x, sizeof(*d_x), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_y, y, sizeof(*y), cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_d_y, d_y, sizeof(*d_y), cudaMemcpyDeviceToHost);

    // Print results
    printf("%f %f\n", host_x, host_y);
    printf("%f %f\n", host_d_x, host_d_y);
    
}

with compile option

  clang++ -I/usr/local/cuda-11.8/include example.cu -o example -O2  --cuda-path=/usr/local/cuda-11.8/ --cuda-gpu-arch=sm_86 -L/usr/local/cuda-11.8/lib64 -lcudart -v -DENABLE_ENZYME -fplugin=/home/mixu/software/Enzyme/enzyme/build/Enzyme/ClangEnzyme-15.so -mllvm -enzyme-loose-types=1

@minansys
Copy link
Author

Hi @wsmoses, do you have any suggestions about the custom gradients as shown above? Is there anything wrong with my code? Thanks in advance for your help!

@minansys
Copy link
Author

minansys commented Oct 28, 2024

@wsmoses Thanks for your suggestion. Now the custom gradient for the CPU works fine https://fwd.gymni.ch/eqgSVV. However, the GPU version did not give the expected gradient https://fwd.gymni.ch/vpgh8Y. It seems the customer gradient is not used at all due to it is not on the device code. However If I have __device__ void* __enzyme_register_gradient_AtomicAdd[] =,

then it will complains

__enzyme_register_gradient must be a constant aggregate @__enzyme_register_gradient_AtomicAdd = internal global [3 x ptr] undef, align 16

Can you provide any suggestions? Thanks!

@wsmoses
Copy link
Member

wsmoses commented Nov 25, 2024

Image

@minansys
Copy link
Author

minansys commented Nov 25, 2024

For the reference, the original atomicAdd issue is likely a LLVM issues. Billy has reported a bug for this llvm/llvm-project#117606

@wsmoses for customer gradient issue https://fwd.gymni.ch/OBA2rq,

  1. here is the log for __host__ __device__ void* __enzyme_register_gradient_AtomicAdd[] =
    out_host_device.log

  2. here is the log for __device__ void* __enzyme_register_gradient_AtomicAdd[] =
    out_device.log

Both end up with Use of __enzyme_register_gradient must be a constant aggregate @__enzyme_register_gradient_AtomicAdd = internal global [3 x ptr] undef, align 16

@minansys
Copy link
Author

minansys commented Dec 2, 2024

@wsmoses After building the LLVM using the main branch, and rebuilding the enzyme using the latest LLVM, the atomic add issue is resolved now. Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants