Exponential function for fast calculation of Gaussians #286

wojdyr · 2023-11-01T23:45:09Z

wojdyr
Nov 1, 2023
Maintainer

Here are my notes on choosing an efficient approximation of exp(x),
which is used in gemmi to calculate a sum of 5 Gaussians.

Our use case

X-ray and electron scattering factors have traditionally been approximated by a sum of (usually 4 or 5) Gaussians (and a constant). International Tables for Crystallography contain tables with Gaussian coefficients for all relevant scatterers. In principle, other approximations are possible and could be more efficient, but we don't investigate alternative approximations here. We only want to optimize calculation of the Gaussian function $a \exp(-b r^2)$ by finding a fast and accurate enough approximation of exp().

The coefficients of Gaussians in the Tables have up to 4 digits after the decimal point. Typically, it's 4-6 significant digits. That sum of Gaussians is an approximation of the real atomic form factor. Waasmaier and Kirfel (1995) compare errors in X-ray scattering approximations for Si. The mean errors of two approximations (from ITfC and W-K) are reported to be 0.002 and 0.001; the maximum errors – 0.009 and 0.008. (I'd need to check if it's relative or absolute error).

A few other N-Gaussian approximations to scattering factors are shown in cctbx news in this newsletter from 2004 (p. 27). The errors depend on the number of Gaussians, but in general it seems that relative errors up to 0.01 are OK.

What absolute accuracy of exp should we aim for? 1e-4 should suffice, but let's try 1e-5, to be on the safe side.

The exponent in Gaussian is always negative or zero. Also, if the exponent is below -15, the addend is negligible:
$a \exp(-15) \approx a \cdot 3.1 \cdot 10^{-7}$, where $a < 40$.

How functions are approximated

To compute a special function efficiently, experts in the approximation theory (and practice) often split the function
into segments and approximate each segment by a polynomial or rational function. They may, for example, use the Remez algorithm to find a Chebyshev polynomial that is a good approximation. They also exploit mathematical properties of the function and use various tricks of the trade.

Since I'm not an expert, I'll review here the existing methods, pick one and only tweak it.

Existing approximations

The exponential function has two useful properties:

$\exp(a+b) = \exp(a) \exp(b)$ ; we could calculate exp(12.3456) as exp(12)×exp(.3456) or exp(12)×exp(.34)×exp(.0056).
$\exp(x) = 2^{x / \ln 2}$ and the computer representation of a floating-point number (sign, mantissa, exponent) explicitly stores $n$ from the $2^n$ factor.

But first, let's start with a plain lookup table. Refmac, the fastest mainstream MX refinement program, has a lookup table with a step of 0.001. It takes ~35KiB to tabulate the [-9,0] interval. For comparison, today's computers usually have 32 or 48 or 64KiB of L1d cache per core. CCTBX (and therefore phenix.refine) can use lookup table as an alternative to std::exp(). At least that's what was reported (p. 28) 20 years ago – the table lookup was found to take 1/3 less time than std::exp().

Lookup tables were more popular in the past. Nowadays, some people claim that you (probably) shouldn't use a lookup table, especially a big one, but it depends.

A very different method, based on bit manipulation, was published (a copy) by Nicol Schraudolph in 1999 (see an example C++ implementation). The exponent is computed using integer operations and then its bits are copied into a floating-point number. It's super fast, but its relative accuracy is only about 3%.

The accuracy can be increased by adjusting the mantissa. This usually involves evaluating a polynomial or using a lookup table. Here are some of the projects, papers and StackOverflow answers that include such adjustments:

jhjourdan/SIMD-math-prims (2015 writeup),
Malossi et al (2015, coefficients not included),
Perini and Reitz (2018),
simonpf/fastexp (2018 writeup),
herumi/fmath (uses SIMD intrinsincs),
reyoung/avx_mathfun – uses SSE2 and AVX intrinsics and is used in PyTorch
the original author, N. Schraudolph, answering on StackOverflow in 2018,
others answering a similar SO question.

These projects are either based on Schraudolph's method or employ a similar bit manipulation technique. Other approaches have been explored as well. For instance, nadavrot/fast_log uses a lookup table + a polynomial, though the benchmark results on that page are not particularly encouraging.

The most tested exp implementations are those in the language standard library. Such libraries aim to return a result within 1 ulp of the exact result; it's not required by the standard, but it's expected from a reputable numerical library. Some libraries compute correctly rounded (within 0.5 ulp) values for all possible arguments.

In LLVM libc math both exp and expf are correctly rounded. expf returns a 32-bit float, which has 24 bits in the significand (one bit is implicit). This is about 7 significant decimal digits – more than enough for us. LLVM expf has two exponent tables: for integer values from -104 to 89, and for fractional values k/128. It also has a degree-4 polynomial approximation for values between -1/256 and 1/256. The argument is split into three components, $x=n+k/128+r$, and the value is computed as a product $\exp(x)=\exp(n)\exp(k/128)\exp(r)$.

Accuracy

When using a lookup table without interpolation, errors are largest between the tabulated points. Since exp(x+step/2) / exp(x) = exp(step/2) ≈ 1 + step/2, the max. relative error of a lookup table is about step/2. For step 0.001 it is about 5e-4.

The original Schraudolph's method has relative errors up to 0.03. Computing exp(x) as exp(x/2)/exp(-x/2), as N. Schraudolph proposed on SO, reduces the errors only by half.

The function expapprox(float) from SIMD-math-prims has relative errors below 1e-5 (I checked all the 32-bit numbers in the range of our interest). Here is how relative errors depend on x:

Benchmarks

Using a modified benchmark from SIMD-math-prims compiled with Clang 17 and glibc on Linux, with -O3, I got the following results for the standard library functions, expapprox from SIMD-math-prims and Schraudolph's 1999 approximation:

std::exp       213M/s
std::expf      302M/s
expapprox     1111M/s
Schraudolph   2247M/s

std::exp[f] functions are not affected by compilation options, although in GCC they are about 2x faster with -ffast-math.
After removing checks for minimum and maximum boundaries of the argument, I got [M/s]:

              -O3       -O3 -mavx   -O3 -mavx2   -O3 -mfma
expapprox     1392        2539        3193         5258   
Schraudolph   3438        6721        6679         8145

(In the example above -mfma -mavx2 gave me numbers slightly lower than -mfma alone.)

With lower-bound checks only (because the exp argument in Gaussian can't be positive):

              -O3       -O3 -mavx   -O3 -mavx2   -O3 -mfma   -O3 -mfma -mavx2
expapprox     1261        2344        2333         3953         4583
Schraudolph   2362        4751        4962         5868         5608

The same (exponential with lower-bound checks) compiled with GCC 12 gave similar results (±10%), except that -mfma -mavx2 was again slower than -mfma alone.

Conclusions

The efficiency of expapprox() depends strongly on the available processor instructions.

The inverse throughput can be even below one cycle, so benchmarking it in isolation doesn't give us a good picture.

We calculate an atom's contribution to a density grid point using 5 Gaussians. When compiling it for a generic x64 processor with 128-bit SIMD registers (4 floats), the program will (optimally) compute 4 Gaussians together and then the 5th one alone. Computing only 4 Gaussians would be more efficient, and with 8 Gaussians we'd get a better approximation at little extra cost. We could consider deriving our own structure factor approximations.

Selection

There are too many approximations of the exponential function to check them all. I picked expapprox() from SIMD-math-prims because it does not use processor-specific intrinsics, can be SIMD-vectorized, has a suitable trade-off between accuracy and efficiency, and the polynomial coefficients were carefully adjusted and discussed (under the blog post, on reddit, in issues). Additionally, if we ever wanted to increase or decrease the order of the polynomial, we could use Sollya to recalculate the coefficients. Other approximations with similar accuracy probably have similar performance (let me know if I'm wrong about this).

Tweaking

expapprox() does type-punning through a union, which is allowed in C, but not in C++. I changed it to memcpy.

We have an inner loop with the exponential function like this:

for (int i = 0; i < N; ++i)  // constexpr int N=5
  density += a[i] * expapprox(b[i] * x);

I think at one point it didn't produce a neatly vectorized code until I moved bound checking to a separate loop:

for (int i = 0; i < N; ++i)
  tmp[i] = std::max(b[i] * r2, -88.f);
for (int i = 0; i < N; ++i)
  density += a[i] * unchecked_expapprox(tmp[i]);

but now I'm not 100% sure it was the case. I'll look into it later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential function for fast calculation of Gaussians #286

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Exponential function for fast calculation of Gaussians #286

wojdyr Nov 1, 2023 Maintainer

Our use case

How functions are approximated

Existing approximations

Accuracy

Benchmarks

Conclusions

Selection

Tweaking

Replies: 0 comments

wojdyr
Nov 1, 2023
Maintainer