Faster approximation for Oklab conversions #76

RunDevelopment · 2025-09-21T12:24:44Z

As I said #75, more SIMD-friendly approximations for expensive functions could improve performance. So in this PR, I did exactly that.

srgb_to_linear now uses a 4th degree polynomial. This is faster than the rational approximation I used before, and has the nice property that its output range is the same as the output range of the real function, so no clamping is necessary to get the values between 0 and 1.
cbrt now uses a Halley iteration. Frankly, I don't even know what it did before, but one iteration of Halley's method gives enough accuracy while being significantly cheaper.

I also replaced the / 3 with a faster multiply+shift version, using the fact that we only compute the cube root for positive numbers. For negative numbers, the result will be slightly wrong, but even that shouldn't be a problem, since the / 3 is only used for the initial guess.

These 2 changes together bring down encoding time by 12% on my machine. A nice win.

Notes:

Using a 3rd degree polynomial for srgb_to_linear caused PSNR to drop noticeably.
I could not find a better approximation for linear_to_srgb. I even made custom tooling to hopefully find a better rational approximation, but I just couldn't beat it. At the very least, I am reasonably confident that there aren't any rational approximations that can beat it.

awxkee

Don't take comments too seriously.

Power function for power 1/2.4, ulp ~350

#[inline(always)]
pub(crate) fn fmlaf(a: f32, b: f32, c: f32) -> f32 {
    #[cfg(any(
        all(
            any(target_arch = "x86", target_arch = "x86_64"),
            target_feature = "fma"
        ),
        target_arch = "aarch64"
    ))]
    {
        f32::mul_add(a, b, c)
    }
    #[cfg(not(any(
        all(
            any(target_arch = "x86", target_arch = "x86_64"),
            target_feature = "fma"
        ),
        target_arch = "aarch64"
    )))]
    {
        a * b + c
    }
}

fn power_1_over_2p4(x: f32) -> f32 {
    // <<FunctionApproximations`
    // ClearAll["Global`*"]
    // f[x_]:=x^(1/2.4)
    // {err,approx}=MiniMaxApproximation[f[x],{x,{0.0031308,1},5,5},WorkingPrecision->120]
    // poly=Numerator[approx][[1]];
    // coeffs=CoefficientList[poly,x];
    // TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]
    // poly=Denominator[approx][[1]];
    // coeffs=CoefficientList[poly,x];
    // TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]

    // {(0.0289338010459052768667829149000052753343873331610775276582557678250851411234301218465061660352860953295187058758737912162+53.8623763009617827238804917522486374010590522684609844916014721001247959077817057340461311250842991318301379035219977324 x+8461.66353644535730096787657341327383515294183720807166590641979127794213938139114937567679471650463944535822518303170833 x^2+211128.049488272220422036251872130329956832786987819720589556696955655573725068505543545903583186876442851765586421440421 x^3+900313.518840443904876317689759512896411972559937409425974431130627993863616349335036691645384583492243036482496509529713 x^4+492315.307389668369449762240479335126870840793920889759480547361599802163460950922479829770546763836543288961586364795476 x^5)/(1+564.074027620019536981196523902363185281203265251643728891604383764379752662060776624555299302071834364195384627871876281 x+40001.4751754530351018342773752031877898539860304695946652893823712363377779862830093570314799330771927269626977941502951 x^2+481502.165256419970578502904463056896789378927643242864543258820988720081861499820733694542963854146311003501446078552596 x^3+934232.440283741344491840394012151287524709719415465841614934991382211082112980461022109810287512020640854392423290894099 x^4+155984.639070076857760721626042860907787145662041895286887431732348206040358627511992445378871099253349296424713626537253 x^5),-8.40481927933470710194099105449220633672892953521701393192821980736071525155554362532804325719077695781594081505798762639*10^-6}}
    const P: [u32; 6] = [
        0x3ced0694, 0x42577313, 0x460436a7, 0x484e2e03, 0x495bcd98, 0x48f0636a,
    ];
    const Q: [u32; 6] = [
        0x3f800000, 0x440d04bd, 0x471c417a, 0x48eb1bc5, 0x49641587, 0x48185429,
    ];
    #[inline(always)]
    #[allow(clippy::too_many_arguments)]
    pub(crate) fn f_polyeval6(x: f32, a0: f32, a1: f32, a2: f32, a3: f32, a4: f32, a5: f32) -> f32 {
        let x2 = x * x;

        let u0 = fmlaf(x, a5, a4);
        let u1 = fmlaf(x, a3, a2);
        let u2 = fmlaf(x, a1, a0);

        let v0 = fmlaf(x2, u0, u1);

        fmlaf(x2, v0, u2)
    }
    let p = f_polyeval6(x, f32::from_bits(P[0]), f32::from_bits(P[1]), f32::from_bits(P[2]), f32::from_bits(P[3]), f32::from_bits(P[4]), f32::from_bits(P[5]));
    let q = f_polyeval6(x, f32::from_bits(Q[0]), f32::from_bits(Q[1]), f32::from_bits(Q[2]), f32::from_bits(Q[3]), f32::from_bits(Q[4]), f32::from_bits(Q[5]));
    p / q
}

src/color/oklab.rs

RunDevelopment · 2025-09-26T20:50:17Z

Thanks for the comments @awxkee!

I'll go over them and test everything in the next couple of days. I've just been procrastinating on this PR, because investigating the rounding error that fails CI is annoying...

Just three things:

Firstly, I don't have a lot of experience with numerical approximations. I'm just trying out a bunch of stuff and measuring the results. So I'm very open to ideas.
The required precision is typically on the lower side. srgb_to_linear is about 1e-4, linear_to_srgb is about 1e-3, and cbrt is about 1e-5. More or less. I don't have hard requirements for these functions, because I go by the PSNR of perceptually encoded images instead. If PSNR drops significantly, the precision is too low.

The only hard requirements I test for is that srgb -> oklab has a max error of 1e-3 and srgb -> oklab -> srgb has a max error of 2.5e-3. Basically, it should be roughly within the rounding error for 8-bit colors.
Thank you for the info about FMA. I know that people are using for this purpose, but I was worried about it costing extra perf on some platforms. However, if it is equally fast or faster on certain targets, then using it is a no-brainer. I'll have to look into that.

RunDevelopment · 2025-09-29T20:36:24Z

I spend some time on this. Some nice wins gain. Perceptual encoding is now 16% faster on my machine.

Details:

Using the optimized polynomial form for srgb_to_linear and linear_to_srgb gives 1~2% each. So about a 3% reduction in encoding time total.
FMA is awesome. 4% encoding time reduction just by compiling for it.
Using one newton iteration in cbrt results is too low precision for my taste. It's not bad on paper with a max error of 0.00064534 for the cbrt function, but errors accumulate.

Method Max Error Encoding time

1 Newton 0.06191 (0.06197) 10.181 ms

2 Newton 0.00236 (0.00082) 10.969 ms

1 Halley 0.00240 (0.00126) 10.555 ms

Max error here is the max error of srgb -> oklab -> srgb, so a round trip. This is what the encoder actually does, so it accurately represents the error it has to deal with. That's the first number. The second number (in parens) is the max error when using the reference implementation for oklab -> srgb. Only the srgb -> oklab part uses cbrt, so this second max error isn't affected by the quite inaccurate linear_to_srgb approximation.

This data suggests that 2 Newton iterations are unnecessarily precise. The additional precision is offset by the linear_to_srgb approximation. Meanwhile, 1 Halley iteration seems to yield exactly the right amount of precision.

1 Newton iteration isn't enough precision for me. Its max error in 8 bit (0.06*255=15) is just too much. Interestingly, this large max error doesn't impact PSNR all that much. That's probably because the mean error is just 0.00319, which is well within quantization error for R5G6B5. My concern is that certain images may contain lots of colors that cause above average error for the oklab conversion.

So I think that cbrt should keep using one Halley iteration to ensure consistent quality, even for edge cases.
The Newton iteration form you suggested didn't work out. Because it computed 1/a, a=0 needed to be treated as a special case. This plus the fact that needs a lot of multiplications meant that this form ended up slower than the regular form of newton for 1 and 2 iterations.

Power function for power 1/2.4, ulp ~350

Sorry, when I said "I am reasonably confident that there aren't any rational approximations that can beat it," I meant rational approximation of the same degrees. The current 3,3 rational approximation is the result of hours of me fine-tuning coefficients by hand (I used matlab for the initial coefficients). So I have trouble finding anything better now.

awxkee · 2025-09-29T23:43:02Z

All numerical approximations are actually a bit more about thousand of tests rather than defined knowledge. :)

Nice to see FMA works! By the way, have you tried wrapping your Halley’s iteration in FMA like so:

let s = t * t * t;
t * fma(2.0, x, s) / fma(2.0, s, x)

FMA on most CPUs where it is supported nowaday, have a latency the same as 1 float/double multiplication and CPUs can issue 2 FMAs at once ( with some exceptions ). Thus, using it where it is available, and you're obviously not blocking all the pipeline, in the most of cases is a good idea.

O::linear_to_srgb(rgb).clamp(Vec3A::ZERO, Vec3A::ONE)

Clamp in Rust has not been in the best condition for a some time, if performance is a concern try to avoid this. rust-lang/rust#125738

RunDevelopment · 2025-09-30T08:54:05Z

By the way, have you tried wrapping your Halley’s iteration in FMA like so:

Oops. I must have undone that while trying things. Easy fix.

Clamp in Rust has not been in the best condition for a some time, if performance is a concern try to avoid this

That's not Ord::clamp, it's Vec3A::clamp which should compile into 2 instructions.

Also, it seems like rust-lang/rust#125738 could be closed soon. I checked out the compiler explorer links and on rustc beta, they all produce the exact same assembly. Probably because they updated to LLVM 21.

RunDevelopment · 2025-09-30T09:04:33Z

FMA is so nice. Using it for the Halley iteration got it down to 10.278 ms on my machine. A 17% improvement.

awxkee · 2025-09-30T09:25:46Z

Yeah, this is kind a sad fact Rust doesn't offer any kind of "safe" runtime dispatch. Therefore assuming most users just build generic binaries for x86_64, they won’t benefit from FMA. But at least everyone on aarch64, as well as those building target-dependent binaries, they will.

RunDevelopment · 2025-10-01T09:34:41Z

There is the multiversion crate, but that one doesn't seem to work well across function boundaries.

awxkee · 2025-10-01T15:37:02Z

Rust at the moment doesn't allow passing features between different function contexts, but methods may be inlined. When inlining happens context of top-level method is preserved. There was some proposal for context inheritance but there has been no progress.

If you want to you could mark all methods as #[inline(always)] ( such short methods are better to be inlined anyways ) and mark top-level method, into which all methods are inlined, as FMA enabled through multiversion. That should work. This is a common practice in runtime dispatch. Occasionally surprising things happen when Rust/LLVM refuses to inline methods, but this is rare in practice.

I personally don't like multiversion concept, it just wraps methods into a #[target_feature(foobar) unsafe fn and calls it. I can do that myself. I’m not sure that any potentially dangerous macro-based crates are really needed.

RunDevelopment · 2025-10-02T19:26:46Z

it just wraps methods into a #[target_feature(foobar) unsafe fn and calls it. I can do that myself. I’m not sure that any potentially dangerous macro-based crates are really needed. I’m not sure that any potentially dangerous macro-based crates are really needed.

I don't want this crate to contain any unsafe, so some crate has to handle it for me. This is a self-imposed limitation, yes, but not having to worry about safety is nice.

I am also not more worried about macro-based crates. Any dependency is worrisome. Who knows what their build.rs will look like in their next patch release. I try to minimize dependencies, some can't be avoided.

multiversion can be avoided by just eating the lost performance, though.

If you want to you could mark all methods as #[inline(always)] ( such short methods are better to be inlined anyways ) and mark top-level method, into which all methods are inlined, as FMA enabled through multiversion. That should work.

I tried to mult-iversion the fast_* oklab functions and #[inline(always)]-ing all function they call, but that ended up slowing down the encoder by ~1%. I either did something wrong, or the dispatch overhead eats the perf gain since the functions might have been too small.

Multi-versioning large parts of the BC1 encoder might not be a bad idea, but further testing is needed.

For now, I'm just going to make an issue for multi-versioning encoders.

RunDevelopment and others added 6 commits September 14, 2025 11:37

Improve SIMD usage of Oklab conversions

862c376

Remove unused scalar implementations

dcb151a

Faster approximation for sRGB to linear

908cc9b

Faster cbrt approximation

fe478bd

Merge branch 'main' into oklab-approximation

3e98d20

Use multiply+shift for /3

adb0469

awxkee reviewed Sep 22, 2025

View reviewed changes

src/color/oklab.rs Outdated Show resolved Hide resolved

src/color/oklab.rs Outdated Show resolved Hide resolved

FMA + more optimal polynomials

7b2b523

Trigger CI

c38bfa6

RunDevelopment mentioned this pull request Sep 30, 2025

Investigate clippy CI failure #77

Merged

RunDevelopment added 2 commits September 30, 2025 10:55

Merge branch 'main' into oklab-approximation

a2a2d5d

Use FMA for the Halley iteration

2f4ffa1

RunDevelopment mentioned this pull request Oct 2, 2025

Multi-version encoders #81

Open

Merge branch 'main' into oklab-approximation

6ab3ecb

RunDevelopment merged commit d41a0cd into image-rs:main Oct 5, 2025
10 checks passed

RunDevelopment deleted the oklab-approximation branch October 5, 2025 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster approximation for Oklab conversions #76

Faster approximation for Oklab conversions #76

Uh oh!

RunDevelopment commented Sep 21, 2025

Uh oh!

awxkee left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

RunDevelopment commented Sep 26, 2025

Uh oh!

RunDevelopment commented Sep 29, 2025

Uh oh!

awxkee commented Sep 29, 2025

Uh oh!

RunDevelopment commented Sep 30, 2025

Uh oh!

RunDevelopment commented Sep 30, 2025

Uh oh!

awxkee commented Sep 30, 2025

Uh oh!

RunDevelopment commented Oct 1, 2025

Uh oh!

awxkee commented Oct 1, 2025

Uh oh!

RunDevelopment commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster approximation for Oklab conversions #76

Faster approximation for Oklab conversions #76

Uh oh!

Conversation

RunDevelopment commented Sep 21, 2025

Uh oh!

awxkee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RunDevelopment commented Sep 26, 2025

Uh oh!

RunDevelopment commented Sep 29, 2025

Uh oh!

awxkee commented Sep 29, 2025

Uh oh!

RunDevelopment commented Sep 30, 2025

Uh oh!

RunDevelopment commented Sep 30, 2025

Uh oh!

awxkee commented Sep 30, 2025

Uh oh!

RunDevelopment commented Oct 1, 2025

Uh oh!

awxkee commented Oct 1, 2025

Uh oh!

RunDevelopment commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

awxkee left a comment •

edited

Loading