Skip to content

Conversation

@hjanuschka
Copy link
Collaborator

@hjanuschka hjanuschka commented Dec 22, 2025

SIMD fast paths for the int_to_float function which converts custom bit-depth floats stored as i32 back to f32.

32-bit float: straightforward bitcast via SIMD.

16-bit float (f16): SIMD handles normal values, zeros, and inf/nan. Subnormals fall back to scalar since they need a variable-iteration normalization loop.

Waiting for perf CI to see the impact.

Add SIMD fast paths for converting custom bit-depth floats to f32:
- 32-bit float passthrough: Simple bitcast using SIMD
- 16-bit float (f16/half-precision): SIMD conversion with scalar fallback
  for subnormal values

The 16-bit float SIMD path handles normal, zero, and inf/nan cases directly,
falling back to scalar for the rare subnormal case which requires
variable-iteration normalization.

Also adds BitDepth::f16() test helper and comprehensive unit tests for
the conversion functions.
@github-actions
Copy link

github-actions bot commented Dec 22, 2025

Benchmark @ 85ee297

MULTI-FILE BENCHMARK RESULTS (4 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.66). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: 352a1543 (Base) vs a1817c3d (PR)

File Base (MP/s) PR (MP/s) Δ%
bike.jxl 23.839 23.241 -2.51% ±1.8%
green_queen_modular_e3.jxl 8.093 6.372 -21.26% ±0.9%
green_queen_vardct_e3.jxl 20.070 19.593 -2.38% ±0.9%
sunset_logo.jxl 2.244 2.303 +2.63% ±1.7%

Address veluca93 review: add load_f16_bits() and store_f16() methods
to F32SimdVec trait instead of implementing conversion in convert.rs.

- AVX2+F16C: Hardware _mm256_cvtph_ps/_mm256_cvtps_ph
- AVX-512: Hardware _mm512_cvtph_ps/_mm512_cvtps_ph
- SSE4.2/NEON/Scalar: Scalar fallback

Simplifies convert.rs by ~100 lines.
fn load_f16_bits(d: Self::Descriptor, mem: &[u16]) -> Self {
assert!(mem.len() >= Self::LEN);
// Check for F16C at runtime and use hardware conversion if available
if is_x86_feature_detected!("f16c") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a good idea. Given that f16c is as common as avx2 (if not more), let's just always require f16c for the AVX2 path.


fn store_f16(this: F32VecNeon, dest: &mut [u16]) {
assert!(dest.len() >= F32VecNeon::LEN);
// TODO: Use vcvt_f16_f32 once Rust stdarch fix lands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point I would just use inline ASM here, but we can do that as a follow-up.

unsafe fn load_f16_impl(d: Avx512Descriptor, mem: &[u16]) -> F32VecAvx512 {
// SAFETY: mem.len() >= 16 is checked by caller, and avx512f is available
unsafe {
let bits = _mm256_loadu_si256(mem.as_ptr() as *const __m256i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the loadu needs to be in an unsafe block.

unsafe {
// _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC = 0
let bits = _mm512_cvtps_ph::<0>(v);
_mm256_storeu_si256(dest.as_mut_ptr() as *mut __m256i, bits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, only the store needs to be in an unsafe block.

// AVX512 implies F16C, so we can always use hardware conversion
#[target_feature(enable = "avx512f")]
#[inline]
unsafe fn load_f16_impl(d: Avx512Descriptor, mem: &[u16]) -> F32VecAvx512 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not need to be unsafe if we move the assert inside.

// SAFETY: dest.len() >= 16 is checked by caller, and avx512f is available
unsafe {
// _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC = 0
let bits = _mm512_cvtps_ph::<0>(v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please use ::<{_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC}>.


use super::{F32SimdVec, I32SimdVec, SimdDescriptor, SimdMask};

/// Convert f16 bits (as u16) to f32.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already https://github.com/libjxl/jxl-rs/blob/main/jxl/src/util/float16.rs that has conversion code.

I think we should use that type and code (perhaps by moving the code to the jxl_simd crate), instead of using u16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants