Skip to content

Conversation

RunDevelopment
Copy link
Collaborator

As I said #75, more SIMD-friendly approximations for expensive functions could improve performance. So in this PR, I did exactly that.

  • srgb_to_linear now uses a 4th degree polynomial. This is faster than the rational approximation I used before, and has the nice property that its output range is the same as the output range of the real function, so no clamping is necessary to get the values between 0 and 1.

  • cbrt now uses a Halley iteration. Frankly, I don't even know what it did before, but one iteration of Halley's method gives enough accuracy while being significantly cheaper.

    I also replaced the / 3 with a faster multiply+shift version, using the fact that we only compute the cube root for positive numbers. For negative numbers, the result will be slightly wrong, but even that shouldn't be a problem, since the / 3 is only used for the initial guess.

These 2 changes together bring down encoding time by 12% on my machine. A nice win.

Notes:

  • Using a 3rd degree polynomial for srgb_to_linear caused PSNR to drop noticeably.
  • I could not find a better approximation for linear_to_srgb. I even made custom tooling to hopefully find a better rational approximation, but I just couldn't beat it. At the very least, I am reasonably confident that there aren't any rational approximations that can beat it.

Copy link
Contributor

@awxkee awxkee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't take comments too seriously.

Power function for power 1/2.4, ulp ~350

#[inline(always)]
pub(crate) fn fmlaf(a: f32, b: f32, c: f32) -> f32 {
    #[cfg(any(
        all(
            any(target_arch = "x86", target_arch = "x86_64"),
            target_feature = "fma"
        ),
        target_arch = "aarch64"
    ))]
    {
        f32::mul_add(a, b, c)
    }
    #[cfg(not(any(
        all(
            any(target_arch = "x86", target_arch = "x86_64"),
            target_feature = "fma"
        ),
        target_arch = "aarch64"
    )))]
    {
        a * b + c
    }
}

fn power_1_over_2p4(x: f32) -> f32 {
    // <<FunctionApproximations`
    // ClearAll["Global`*"]
    // f[x_]:=x^(1/2.4)
    // {err,approx}=MiniMaxApproximation[f[x],{x,{0.0031308,1},5,5},WorkingPrecision->120]
    // poly=Numerator[approx][[1]];
    // coeffs=CoefficientList[poly,x];
    // TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]
    // poly=Denominator[approx][[1]];
    // coeffs=CoefficientList[poly,x];
    // TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]

    // {(0.0289338010459052768667829149000052753343873331610775276582557678250851411234301218465061660352860953295187058758737912162+53.8623763009617827238804917522486374010590522684609844916014721001247959077817057340461311250842991318301379035219977324 x+8461.66353644535730096787657341327383515294183720807166590641979127794213938139114937567679471650463944535822518303170833 x^2+211128.049488272220422036251872130329956832786987819720589556696955655573725068505543545903583186876442851765586421440421 x^3+900313.518840443904876317689759512896411972559937409425974431130627993863616349335036691645384583492243036482496509529713 x^4+492315.307389668369449762240479335126870840793920889759480547361599802163460950922479829770546763836543288961586364795476 x^5)/(1+564.074027620019536981196523902363185281203265251643728891604383764379752662060776624555299302071834364195384627871876281 x+40001.4751754530351018342773752031877898539860304695946652893823712363377779862830093570314799330771927269626977941502951 x^2+481502.165256419970578502904463056896789378927643242864543258820988720081861499820733694542963854146311003501446078552596 x^3+934232.440283741344491840394012151287524709719415465841614934991382211082112980461022109810287512020640854392423290894099 x^4+155984.639070076857760721626042860907787145662041895286887431732348206040358627511992445378871099253349296424713626537253 x^5),-8.40481927933470710194099105449220633672892953521701393192821980736071525155554362532804325719077695781594081505798762639*10^-6}}
    const P: [u32; 6] = [
        0x3ced0694, 0x42577313, 0x460436a7, 0x484e2e03, 0x495bcd98, 0x48f0636a,
    ];
    const Q: [u32; 6] = [
        0x3f800000, 0x440d04bd, 0x471c417a, 0x48eb1bc5, 0x49641587, 0x48185429,
    ];
    #[inline(always)]
    #[allow(clippy::too_many_arguments)]
    pub(crate) fn f_polyeval6(x: f32, a0: f32, a1: f32, a2: f32, a3: f32, a4: f32, a5: f32) -> f32 {
        let x2 = x * x;

        let u0 = fmlaf(x, a5, a4);
        let u1 = fmlaf(x, a3, a2);
        let u2 = fmlaf(x, a1, a0);

        let v0 = fmlaf(x2, u0, u1);

        fmlaf(x2, v0, u2)
    }
    let p = f_polyeval6(x, f32::from_bits(P[0]), f32::from_bits(P[1]), f32::from_bits(P[2]), f32::from_bits(P[3]), f32::from_bits(P[4]), f32::from_bits(P[5]));
    let q = f_polyeval6(x, f32::from_bits(Q[0]), f32::from_bits(Q[1]), f32::from_bits(Q[2]), f32::from_bits(Q[3]), f32::from_bits(Q[4]), f32::from_bits(Q[5]));
    p / q
}

@RunDevelopment
Copy link
Collaborator Author

Thanks for the comments @awxkee!

I'll go over them and test everything in the next couple of days. I've just been procrastinating on this PR, because investigating the rounding error that fails CI is annoying...

Just three things:

  1. Firstly, I don't have a lot of experience with numerical approximations. I'm just trying out a bunch of stuff and measuring the results. So I'm very open to ideas.

  2. The required precision is typically on the lower side. srgb_to_linear is about 1e-4, linear_to_srgb is about 1e-3, and cbrt is about 1e-5. More or less. I don't have hard requirements for these functions, because I go by the PSNR of perceptually encoded images instead. If PSNR drops significantly, the precision is too low.

    The only hard requirements I test for is that srgb -> oklab has a max error of 1e-3 and srgb -> oklab -> srgb has a max error of 2.5e-3. Basically, it should be roughly within the rounding error for 8-bit colors.

  3. Thank you for the info about FMA. I know that people are using for this purpose, but I was worried about it costing extra perf on some platforms. However, if it is equally fast or faster on certain targets, then using it is a no-brainer. I'll have to look into that.

@RunDevelopment
Copy link
Collaborator Author

I spend some time on this. Some nice wins gain. Perceptual encoding is now 16% faster on my machine.

Details:

  1. Using the optimized polynomial form for srgb_to_linear and linear_to_srgb gives 1~2% each. So about a 3% reduction in encoding time total.

  2. FMA is awesome. 4% encoding time reduction just by compiling for it.

  3. Using one newton iteration in cbrt results is too low precision for my taste. It's not bad on paper with a max error of 0.00064534 for the cbrt function, but errors accumulate.

    Method Max Error Encoding time
    1 Newton 0.06191 (0.06197) 10.181 ms
    2 Newton 0.00236 (0.00082) 10.969 ms
    1 Halley 0.00240 (0.00126) 10.555 ms

    Max error here is the max error of srgb -> oklab -> srgb, so a round trip. This is what the encoder actually does, so it accurately represents the error it has to deal with. That's the first number. The second number (in parens) is the max error when using the reference implementation for oklab -> srgb. Only the srgb -> oklab part uses cbrt, so this second max error isn't affected by the quite inaccurate linear_to_srgb approximation.

    This data suggests that 2 Newton iterations are unnecessarily precise. The additional precision is offset by the linear_to_srgb approximation. Meanwhile, 1 Halley iteration seems to yield exactly the right amount of precision.

    1 Newton iteration isn't enough precision for me. Its max error in 8 bit (0.06*255=15) is just too much. Interestingly, this large max error doesn't impact PSNR all that much. That's probably because the mean error is just 0.00319, which is well within quantization error for R5G6B5. My concern is that certain images may contain lots of colors that cause above average error for the oklab conversion.

    So I think that cbrt should keep using one Halley iteration to ensure consistent quality, even for edge cases.

  4. The Newton iteration form you suggested didn't work out. Because it computed 1/a, a=0 needed to be treated as a special case. This plus the fact that needs a lot of multiplications meant that this form ended up slower than the regular form of newton for 1 and 2 iterations.

Power function for power 1/2.4, ulp ~350

Sorry, when I said "I am reasonably confident that there aren't any rational approximations that can beat it," I meant rational approximation of the same degrees. The current 3,3 rational approximation is the result of hours of me fine-tuning coefficients by hand (I used matlab for the initial coefficients). So I have trouble finding anything better now.

@awxkee
Copy link
Contributor

awxkee commented Sep 29, 2025

All numerical approximations are actually a bit more about thousand of tests rather than defined knowledge. :)

Nice to see FMA works! By the way, have you tried wrapping your Halley’s iteration in FMA like so:

let s = t * t * t;
t * fma(2.0, x, s) / fma(2.0, s, x)

FMA on most CPUs where it is supported nowaday, have a latency the same as 1 float/double multiplication and CPUs can issue 2 FMAs at once ( with some exceptions ). Thus, using it where it is available, and you're obviously not blocking all the pipeline, in the most of cases is a good idea.

O::linear_to_srgb(rgb).clamp(Vec3A::ZERO, Vec3A::ONE)

Clamp in Rust has not been in the best condition for a some time, if performance is a concern try to avoid this. rust-lang/rust#125738

@RunDevelopment
Copy link
Collaborator Author

By the way, have you tried wrapping your Halley’s iteration in FMA like so:

Oops. I must have undone that while trying things. Easy fix.

Clamp in Rust has not been in the best condition for a some time, if performance is a concern try to avoid this

That's not Ord::clamp, it's Vec3A::clamp which should compile into 2 instructions.

Also, it seems like rust-lang/rust#125738 could be closed soon. I checked out the compiler explorer links and on rustc beta, they all produce the exact same assembly. Probably because they updated to LLVM 21.

@RunDevelopment
Copy link
Collaborator Author

FMA is so nice. Using it for the Halley iteration got it down to 10.278 ms on my machine. A 17% improvement.

@awxkee
Copy link
Contributor

awxkee commented Sep 30, 2025

Yeah, this is kind a sad fact Rust doesn't offer any kind of "safe" runtime dispatch. Therefore assuming most users just build generic binaries for x86_64, they won’t benefit from FMA. But at least everyone on aarch64, as well as those building target-dependent binaries, they will.

@RunDevelopment
Copy link
Collaborator Author

There is the multiversion crate, but that one doesn't seem to work well across function boundaries.

@awxkee
Copy link
Contributor

awxkee commented Oct 1, 2025

Rust at the moment doesn't allow passing features between different function contexts, but methods may be inlined. When inlining happens context of top-level method is preserved. There was some proposal for context inheritance but there has been no progress.

If you want to you could mark all methods as #[inline(always)] ( such short methods are better to be inlined anyways ) and mark top-level method, into which all methods are inlined, as FMA enabled through multiversion. That should work. This is a common practice in runtime dispatch. Occasionally surprising things happen when Rust/LLVM refuses to inline methods, but this is rare in practice.

I personally don't like multiversion concept, it just wraps methods into a #[target_feature(foobar) unsafe fn and calls it. I can do that myself. I’m not sure that any potentially dangerous macro-based crates are really needed.

@RunDevelopment
Copy link
Collaborator Author

it just wraps methods into a #[target_feature(foobar) unsafe fn and calls it. I can do that myself. I’m not sure that any potentially dangerous macro-based crates are really needed. I’m not sure that any potentially dangerous macro-based crates are really needed.

I don't want this crate to contain any unsafe, so some crate has to handle it for me. This is a self-imposed limitation, yes, but not having to worry about safety is nice.

I am also not more worried about macro-based crates. Any dependency is worrisome. Who knows what their build.rs will look like in their next patch release. I try to minimize dependencies, some can't be avoided.

multiversion can be avoided by just eating the lost performance, though.

If you want to you could mark all methods as #[inline(always)] ( such short methods are better to be inlined anyways ) and mark top-level method, into which all methods are inlined, as FMA enabled through multiversion. That should work.

I tried to mult-iversion the fast_* oklab functions and #[inline(always)]-ing all function they call, but that ended up slowing down the encoder by ~1%. I either did something wrong, or the dispatch overhead eats the perf gain since the functions might have been too small.

Multi-versioning large parts of the BC1 encoder might not be a bad idea, but further testing is needed.

For now, I'm just going to make an issue for multi-versioning encoders.

@RunDevelopment RunDevelopment merged commit d41a0cd into image-rs:main Oct 5, 2025
10 checks passed
@RunDevelopment RunDevelopment deleted the oklab-approximation branch October 5, 2025 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants