You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rounding modes are the last performance regression in the NVPTX codegen in a set of applications I'm looking at. See: StanfordLegion/legion#1041 (comment)
It appears that NVVM generates instructions like sub.f32 (no rounding mode) while NVPTX generates sub.rn.f32 (round-to-nearest). The latter lines up with what LLVM says its default is, so is probably the more "correct". But we're losing about 20% performance due to this, so we need a way to fix it.
The constraints from LLVM appear to be as follows:
If you want to use anything other than the default, you have to mark the entire function as strictfp. You must be very careful not to mix strictfp and non-strictfp code. You can still achieve the default behavior in strictfp, but you have to do so by explicitly calling every floating point instruction with round-to-nearest mode. You also can't use LLVM's default floating point instructions, but have to use some special intrinsics instead. Basically, you're going to end up ripping up your entire floating point code generation, which seems like a pain.
Codegen is painful, but there is also the question of what the interface should be. We could do this with macros, but that seems like it will quickly get obnoxious (terralib.fadd(..., "round-to-nearest")). It might be better to set the rounding mode function-wide (fn:setroundingmode("round-to-nearest")). Then we need to figure out the code generation but don't otherwise need to annotate floating point math within a function. You can still get different rounding modes by using different functions, and then :setinlined(true) to get them combined back into a single whole at optimization time.
The text was updated successfully, but these errors were encountered:
If no rounding modifier is specified, default is .rn and instructions may be folded into a multiply-add.
This seems to be an odd conflation of two things: the default rounding mode (which is actually the same as LLVM) and the equivalent of LLVM's contract flag. The latter allows fusing multiply-adds, which is why this has performance impact.
It seems that NVVM always emitted these default-rounding-mode instructions that allowed fusion. That's technically incorrect, since it could give bad results. At any rate, I've now manually replicated the results with a hacky version of Terra so I'm pretty sure that's where the performance is going.
In short, I don't think there's any LLVM rounding mode that would lead to the optimizations I need (since the PTX semantics are... strange), so while this may still be a useful feature, it's not responsible for the performance regression in StanfordLegion/legion#1041 and therefore shouldn't hold back #471.
Rounding modes are the last performance regression in the NVPTX codegen in a set of applications I'm looking at. See: StanfordLegion/legion#1041 (comment)
It appears that NVVM generates instructions like
sub.f32
(no rounding mode) while NVPTX generatessub.rn.f32
(round-to-nearest). The latter lines up with what LLVM says its default is, so is probably the more "correct". But we're losing about 20% performance due to this, so we need a way to fix it.The constraints from LLVM appear to be as follows:
If you want to use anything other than the default, you have to mark the entire function as
strictfp
. You must be very careful not to mixstrictfp
and non-strictfp
code. You can still achieve the default behavior instrictfp
, but you have to do so by explicitly calling every floating point instruction with round-to-nearest mode. You also can't use LLVM's default floating point instructions, but have to use some special intrinsics instead. Basically, you're going to end up ripping up your entire floating point code generation, which seems like a pain.Codegen is painful, but there is also the question of what the interface should be. We could do this with macros, but that seems like it will quickly get obnoxious (
terralib.fadd(..., "round-to-nearest")
). It might be better to set the rounding mode function-wide (fn:setroundingmode("round-to-nearest")
). Then we need to figure out the code generation but don't otherwise need to annotate floating point math within a function. You can still get different rounding modes by using different functions, and then:setinlined(true)
to get them combined back into a single whole at optimization time.The text was updated successfully, but these errors were encountered: