-
Notifications
You must be signed in to change notification settings - Fork 3
Faster approximation for Oklab conversions #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster approximation for Oklab conversions #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't take comments too seriously.
Power function for power 1/2.4, ulp ~350
#[inline(always)]
pub(crate) fn fmlaf(a: f32, b: f32, c: f32) -> f32 {
#[cfg(any(
all(
any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "fma"
),
target_arch = "aarch64"
))]
{
f32::mul_add(a, b, c)
}
#[cfg(not(any(
all(
any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "fma"
),
target_arch = "aarch64"
)))]
{
a * b + c
}
}
fn power_1_over_2p4(x: f32) -> f32 {
// <<FunctionApproximations`
// ClearAll["Global`*"]
// f[x_]:=x^(1/2.4)
// {err,approx}=MiniMaxApproximation[f[x],{x,{0.0031308,1},5,5},WorkingPrecision->120]
// poly=Numerator[approx][[1]];
// coeffs=CoefficientList[poly,x];
// TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]
// poly=Denominator[approx][[1]];
// coeffs=CoefficientList[poly,x];
// TableForm[Table[Row[{"'",NumberForm[coeffs[[i+1]],{50,50},ExponentFunction->(Null&)],"',"}],{i,0,Length[coeffs]-1}]]
// {(0.0289338010459052768667829149000052753343873331610775276582557678250851411234301218465061660352860953295187058758737912162+53.8623763009617827238804917522486374010590522684609844916014721001247959077817057340461311250842991318301379035219977324 x+8461.66353644535730096787657341327383515294183720807166590641979127794213938139114937567679471650463944535822518303170833 x^2+211128.049488272220422036251872130329956832786987819720589556696955655573725068505543545903583186876442851765586421440421 x^3+900313.518840443904876317689759512896411972559937409425974431130627993863616349335036691645384583492243036482496509529713 x^4+492315.307389668369449762240479335126870840793920889759480547361599802163460950922479829770546763836543288961586364795476 x^5)/(1+564.074027620019536981196523902363185281203265251643728891604383764379752662060776624555299302071834364195384627871876281 x+40001.4751754530351018342773752031877898539860304695946652893823712363377779862830093570314799330771927269626977941502951 x^2+481502.165256419970578502904463056896789378927643242864543258820988720081861499820733694542963854146311003501446078552596 x^3+934232.440283741344491840394012151287524709719415465841614934991382211082112980461022109810287512020640854392423290894099 x^4+155984.639070076857760721626042860907787145662041895286887431732348206040358627511992445378871099253349296424713626537253 x^5),-8.40481927933470710194099105449220633672892953521701393192821980736071525155554362532804325719077695781594081505798762639*10^-6}}
const P: [u32; 6] = [
0x3ced0694, 0x42577313, 0x460436a7, 0x484e2e03, 0x495bcd98, 0x48f0636a,
];
const Q: [u32; 6] = [
0x3f800000, 0x440d04bd, 0x471c417a, 0x48eb1bc5, 0x49641587, 0x48185429,
];
#[inline(always)]
#[allow(clippy::too_many_arguments)]
pub(crate) fn f_polyeval6(x: f32, a0: f32, a1: f32, a2: f32, a3: f32, a4: f32, a5: f32) -> f32 {
let x2 = x * x;
let u0 = fmlaf(x, a5, a4);
let u1 = fmlaf(x, a3, a2);
let u2 = fmlaf(x, a1, a0);
let v0 = fmlaf(x2, u0, u1);
fmlaf(x2, v0, u2)
}
let p = f_polyeval6(x, f32::from_bits(P[0]), f32::from_bits(P[1]), f32::from_bits(P[2]), f32::from_bits(P[3]), f32::from_bits(P[4]), f32::from_bits(P[5]));
let q = f_polyeval6(x, f32::from_bits(Q[0]), f32::from_bits(Q[1]), f32::from_bits(Q[2]), f32::from_bits(Q[3]), f32::from_bits(Q[4]), f32::from_bits(Q[5]));
p / q
}
Thanks for the comments @awxkee! I'll go over them and test everything in the next couple of days. I've just been procrastinating on this PR, because investigating the rounding error that fails CI is annoying... Just three things:
|
I spend some time on this. Some nice wins gain. Perceptual encoding is now 16% faster on my machine. Details:
Sorry, when I said "I am reasonably confident that there aren't any rational approximations that can beat it," I meant rational approximation of the same degrees. The current 3,3 rational approximation is the result of hours of me fine-tuning coefficients by hand (I used matlab for the initial coefficients). So I have trouble finding anything better now. |
All numerical approximations are actually a bit more about thousand of tests rather than defined knowledge. :) Nice to see FMA works! By the way, have you tried wrapping your Halley’s iteration in FMA like so: let s = t * t * t;
t * fma(2.0, x, s) / fma(2.0, s, x) FMA on most CPUs where it is supported nowaday, have a latency the same as 1 float/double multiplication and CPUs can issue 2 FMAs at once ( with some exceptions ). Thus, using it where it is available, and you're obviously not blocking all the pipeline, in the most of cases is a good idea. O::linear_to_srgb(rgb).clamp(Vec3A::ZERO, Vec3A::ONE) Clamp in Rust has not been in the best condition for a some time, if performance is a concern try to avoid this. rust-lang/rust#125738 |
Oops. I must have undone that while trying things. Easy fix.
That's not Also, it seems like rust-lang/rust#125738 could be closed soon. I checked out the compiler explorer links and on rustc beta, they all produce the exact same assembly. Probably because they updated to LLVM 21. |
FMA is so nice. Using it for the Halley iteration got it down to 10.278 ms on my machine. A 17% improvement. |
Yeah, this is kind a sad fact Rust doesn't offer any kind of "safe" runtime dispatch. Therefore assuming most users just build generic binaries for x86_64, they won’t benefit from FMA. But at least everyone on aarch64, as well as those building target-dependent binaries, they will. |
There is the |
Rust at the moment doesn't allow passing features between different function contexts, but methods may be inlined. When inlining happens context of top-level method is preserved. There was some proposal for context inheritance but there has been no progress. If you want to you could mark all methods as I personally don't like |
I don't want this crate to contain any I am also not more worried about macro-based crates. Any dependency is worrisome. Who knows what their
I tried to mult-iversion the Multi-versioning large parts of the BC1 encoder might not be a bad idea, but further testing is needed. For now, I'm just going to make an issue for multi-versioning encoders. |
As I said #75, more SIMD-friendly approximations for expensive functions could improve performance. So in this PR, I did exactly that.
srgb_to_linear
now uses a 4th degree polynomial. This is faster than the rational approximation I used before, and has the nice property that its output range is the same as the output range of the real function, so no clamping is necessary to get the values between 0 and 1.cbrt
now uses a Halley iteration. Frankly, I don't even know what it did before, but one iteration of Halley's method gives enough accuracy while being significantly cheaper.I also replaced the
/ 3
with a faster multiply+shift version, using the fact that we only compute the cube root for positive numbers. For negative numbers, the result will be slightly wrong, but even that shouldn't be a problem, since the/ 3
is only used for the initial guess.These 2 changes together bring down encoding time by 12% on my machine. A nice win.
Notes:
srgb_to_linear
caused PSNR to drop noticeably.linear_to_srgb
. I even made custom tooling to hopefully find a better rational approximation, but I just couldn't beat it. At the very least, I am reasonably confident that there aren't any rational approximations that can beat it.