Support rounding modes for floating point math #529

elliottslaughter · 2022-04-05T00:16:35Z

Rounding modes are the last performance regression in the NVPTX codegen in a set of applications I'm looking at. See: StanfordLegion/legion#1041 (comment)

It appears that NVVM generates instructions like sub.f32 (no rounding mode) while NVPTX generates sub.rn.f32 (round-to-nearest). The latter lines up with what LLVM says its default is, so is probably the more "correct". But we're losing about 20% performance due to this, so we need a way to fix it.

The constraints from LLVM appear to be as follows:

If you want to use anything other than the default, you have to mark the entire function as strictfp. You must be very careful not to mix strictfp and non-strictfp code. You can still achieve the default behavior in strictfp, but you have to do so by explicitly calling every floating point instruction with round-to-nearest mode. You also can't use LLVM's default floating point instructions, but have to use some special intrinsics instead. Basically, you're going to end up ripping up your entire floating point code generation, which seems like a pain.

Codegen is painful, but there is also the question of what the interface should be. We could do this with macros, but that seems like it will quickly get obnoxious (terralib.fadd(..., "round-to-nearest")). It might be better to set the rounding mode function-wide (fn:setroundingmode("round-to-nearest")). Then we need to figure out the code generation but don't otherwise need to annotate floating point math within a function. You can still get different rounding modes by using different functions, and then :setinlined(true) to get them combined back into a single whole at optimization time.

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2022-04-05T22:45:22Z

I've been digging some more, and I think this is a bit of a red herring.

The PTX semantics on rounding modes say:

If no rounding modifier is specified, default is .rn and instructions may be folded into a multiply-add.

This seems to be an odd conflation of two things: the default rounding mode (which is actually the same as LLVM) and the equivalent of LLVM's contract flag. The latter allows fusing multiply-adds, which is why this has performance impact.

It seems that NVVM always emitted these default-rounding-mode instructions that allowed fusion. That's technically incorrect, since it could give bad results. At any rate, I've now manually replicated the results with a hacky version of Terra so I'm pretty sure that's where the performance is going.

In short, I don't think there's any LLVM rounding mode that would lead to the optimizations I need (since the PTX semantics are... strange), so while this may still be a useful feature, it's not responsible for the performance regression in StanfordLegion/legion#1041 and therefore shouldn't hold back #471.

elliottslaughter mentioned this issue Apr 5, 2022

RFC: Deprecate support for LLVM <= 5 #471

Closed

6 tasks

elliottslaughter mentioned this issue Apr 5, 2022

Support fast math flags #530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support rounding modes for floating point math #529

Support rounding modes for floating point math #529

elliottslaughter commented Apr 5, 2022

elliottslaughter commented Apr 5, 2022

Support rounding modes for floating point math #529

Support rounding modes for floating point math #529

Comments

elliottslaughter commented Apr 5, 2022

elliottslaughter commented Apr 5, 2022