Skip to content

Conversation

@s-oboyle
Copy link
Contributor

Update the complex asinh function to avoid numerical issues.

The current complex asinh function loses accuracy in several places.
These mostly relate to over/underflow, and catastrophic cancellation for tough values.

This new version fixes these accuracy issues while retaining it's perf.

Perf

On GH100 we don't have much perf difference. (There used to be a much large perf gap until #5371 got merged, which the current version is availing of).
Using the math-teams standard math_bench test we have the following:

Operations/SM/cycle:
casinh():

H100 old new new/old
fp64 0.2531 0.2549 1.01
fp32 0.6072 0.6334 1.04

Correctness

The current version has several intervals where accuracy gets lost.
Apart from the usual over/underflow suspects, there is also some very subtle intervals where accuracy gets badly thrown out, especially by catastrophic cancellation very close to +-i.

This new version fixes these and testing gives the following:

GPU Correctness

For the new version, an intensive bracket and bisect search, along with testing special hard values, gives:

GPU fp64:
Max ulp real error (4.867,1.742) @ (0.007757045272,-0.0002247045536)    (0x3f7fc5d9fc1f5662,0xbf2d73d56affd72d)
        Ours = (0.007756967678,-0.0002246977954)    Ref = (0.007756967678,-0.0002246977954)
        Ours = (0x3f7fc5c527dd1d58,0xbf2d739b5d7a3961)               Ref = (0x3f7fc5c527dd1d53,0xbf2d739b5d7a3963)

Max ulp imag error (0.1719,5.453) @ (7.198570162e+103,-5.623976789e+101)        (0x558011effdb5a3ad,0xd510120190e898df)
        Ours = (239.8333247,-0.007812471425)    Ref = (239.8333247,-0.007812471425)
        Ours = (0x406dfaaa988c99ba,0xbf7ffff854599f01)               Ref = (0x406dfaaa988c99ba,0xbf7ffff854599efc)
GPU fp32:
Max ulp real error (6.619,2.232) @ (0.007812378462,-0.0007928675623)    (0x3bfffefb,0xba4fd871)
        Ours = (0.007812298369,-0.0007928435807)    Ref = (0.007812301628,-0.0007928434643)
        Ours = (0x3bfffe4f,0xba4fd6d5)               Ref = (0x3bfffe56,0xba4fd6d3)

Max ulp imag error (3.732,5.528) @ (0.007806597743,3.029832988e-05)     (0x3bffce7d,0x37fe292b)
        Ours = (0.007806516718,3.029741674e-05)    Ref = (0.007806518581,3.029740583e-05)
        Ours = (0x3bffcdcf,0x37fe2735)               Ref = (0x3bffcdd3,0x37fe272f)

CPU Correctness

CPU fp64:
Max ulp real error (4.125,0) @ (0.01542893159,-0)       (0x3f8f993424afec00,0x8000000000000000)
        Ours = (0.01542831951,-0)    Ref = (0.01542831951,-0)
        Ours = (0x3f8f98e1fdb37251,0x8000000000000000)               Ref = (0x3f8f98e1fdb3724d,0x8000000000000000)

Max ulp imag error (0.5078,3.484) @ (0.8869326854,-1.12001698e-254)     (0x3fec61c0a7b18800,0x8b3505783ad41800)
        Ours = (0.7991224705,-8.379245502e-255)    Ref = (0.7991224705,-8.379245502e-255)
        Ours = (0x3fe99269498d3a37,0x8b2f742347588681)               Ref = (0x3fe99269498d3a36,0x8b2f742347588684)
CPU fp32:
Max ulp real error (4.827,0.5125) @ (0.001131535857,0.9893865585)       (0x3a94500b,0x3f7d4870)
        Ours = (0.007776218932,1.424767017)    Ref = (0.007776216604,1.424766898)
        Ours = (0x3bfecfa7,0x3fb65ec4)               Ref = (0x3bfecfa2,0x3fb65ec3)

Max ulp imag error (0.498,4.695) @ (-5.695841894e+11,3.510264218e+10)   (0xd3049ddd,0x5102c47d)
        Ours = (-27.76321411,0.06155067682)    Ref = (-27.76321411,0.06155069545)
        Ours = (0xc1de1b10,0x3d7c1c90)               Ref = (0xc1de1b10,0x3d7c1c95)

@s-oboyle s-oboyle requested a review from a team as a code owner October 31, 2025 20:57
@s-oboyle s-oboyle requested a review from pciolkosz October 31, 2025 20:57
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 31, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 31, 2025
Copy link
Contributor

@davebayer davebayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things

// An unsafe sqrt(_Tp + _Tp) extended precision sqrt.
template <typename _Tp>
static void __device__ __host__ __forceinline__
__internal_double_Tp_sqrt_unsafe(_Tp __hi, _Tp __lo, _Tp* __out_hi, _Tp* __out_lo) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would strongly prefer if we would return a simple struct here instead of inout parameters

template<class _Tp>
struct _CCCL_ALIGNAS(2 * sizeof(_Tp)) __cccl_sqrt_return_hilo {
  __Tp __hi;
  __Tp __low;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 3, 2025
s-oboyle and others added 7 commits November 3, 2025 16:23
Applied suggested constant initializers.

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: David Bayer <[email protected]>
…ions.h


Replace __device__ __host__

Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h


Add guards for non-cuda compilers

Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h


undo-ing clang-format

Co-authored-by: David Bayer <[email protected]>
Comment on lines 24 to 36
#include <cuda/std/__cmath/abs.h>
#include <cuda/std/__cmath/copysign.h>
#include <cuda/std/__cmath/isinf.h>
#include <cuda/std/__cmath/isnan.h>
#include <cuda/std/__cmath/trigonometric_functions.h>
#include <cuda/std/__complex/complex.h>
#include <cuda/std/__complex/exponential_functions.h>
#include <cuda/std/__complex/logarithms.h>
#include <cuda/std/__complex/nvbf16.h>
#include <cuda/std/__complex/nvfp16.h>
#include <cuda/std/__complex/roots.h>
#include <cuda/std/limits>
#include <cuda/std/numbers>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all of these includes dropped?

Copy link
Contributor Author

@s-oboyle s-oboyle Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My prejudice basically, in math we get compile time build bugs because of header parsing so I usually try take them out if they're not needed. I was over zealous here it seems, not sure why this even built on my machine. Added them back, like I would probably have to do later after changing the other functions also.

@s-oboyle
Copy link
Contributor Author

s-oboyle commented Nov 3, 2025

/ok to test 9629a78

@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

😬 CI Workflow Results

🟥 Finished in 2h 13m: Pass: 80%/90 | Total: 1d 06h | Max: 1h 32m | Hits: 94%/152135

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants