-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933
Comments
Hi Damian — Thanks for capturing this request as an issue. With respect to your questions:
Nope, I think we've got a good grasp on that.
That may be useful, depending on the terms. :) Just looking at your motivating example, it seems as though the main thing that would be required to support it would be an equivalent of To support
Meanwhile, the choices of enabling vs. disabling seem like they could be done in a few different ways via For a C back-end supporting it, it looks like one can just For LLVM, I think we'll need to do a bit more to plumb it through, though it does look as though we'll be able to. Seeing the variations that take a probability, it seems like we could also provide an optional argument to enable specifying probabilities on a case-by-case basis as well (where the default could be set by another config param, or perhaps just indicate that |
Thanks for looking at this Brad. I will put together some examples in C next week so that I can change the isn't in your comment "whether or not this has the desired effect on the generated code isn't immediately obvious to me" ... to is. For a random problem, let's consider the chance of any 16-bit floating point number appearing as the value of a real(16). The chance that it is a normal number is about 97% of the cases, a subnormal about 3% of the time and an infinity or a zero it is about 0.006%. The chance of an NaN is arguably about 0.003% although it can be higher if the programmer is having a bad day so its statistics is more related to how often the algorithm is dodgey. I could have used the probability for real(32) but I cannot do those in my head. Wrapping some boolean expression in unlikely() tells the compiler that branching overhead to get to that block of code whose execution that boolean expression mandates is OK by me. Encapsulating it in a likely() says "please avoid branching". Looking a a paraphrase of the code in #25088, it says if an encoding of a magnitude < the least positive normal then
do THIS // the datum is a subnormal or a zero
else
do THAT // the datum is a normal or an infinity or a NaN 97% of the time I want to do THAT. Only 3% of the time will I want to do THIS. So I prefer that there be no branching before it does THAT, i.e.
So, I would tell the compiler that the object of the then was unlikely, well, ..., technically less likely. So if this feature was available, I could hint to the compiler with: if unlikely(an encoding of a magnitude < the least positive normal) then
do THIS // the datum is a subnormal or a zero
else
do THAT // the datum is a normal or an infinity or a NaN And I would expect that I would get something like that fake assembler about. As a general rule (and I hate that word), you prefix the much less common cases such as the processing of infinities or NaN or zero with unlikely() and the dominant case with likely(). As the precision of a floating point increases, i.e. with a real(32) or real(64), the chance of these non-normal cases becomes even much less common so the performance increases. For real(32) cases which I use a lot, the benefit is significant. Sometimes I play around with the ordering of the likely() and unlikely() cases. and review the assembler to figure out what is optimal although that can change marginally with new releases of an optimizing compiler or whether one is using one compiler or another.. I generally run my algorithms for every single real(32) in existence and compare that result against the same algorithm using real(64) arithmetic. So I encounter all these cases. In practical algorithms, I mostly see an even higher percentage of these normal cases so the benefits are even more significant. I will rework that real(16) code with likely/unlikely in C, and do the same on a complex projection, a real scaling algorithm and an algorithm to yield the content (f, n) of any floating point number x as x == 2**n * f where 1.0 <= | f | < 2.0 I will discuss the assembler produced for an X86-64 and hopefully make the benefits obvious. I might look at nextafter() or nextup() but they are not commonly used routines. I will stick to simple algorithms where the code and assembler is clearer. While I am sure that there are cases where makes the decision based on precise probability would be optimal, if you have to go that far of looking at probabilities, I think the original programmer needs to go back and look at the underlying algorithm. I only use this technique for routines which yield less than 100 or so assembler instructions although I sometimes play around with stuff in the 100-200 line range to see what happens just out of interest. I sometimes get a nice surprise but generally find the second case unproductive and certainly less predictable. |
I will post those examples in full once I figure out how to put an ATO link in these posts. Here is another example with complex arithmetic // map z to itself unless z has an infinite component in which
// case it maps to the infinity in the extended complex plane
proc proj(z : complex(?w))
{
param nan_propogation = true;
param v = w >> 1;
type Real = real(v);
type Uint = uint(v);
param INF = inf:Real;
param _N = 1:Uint << (v - 1);
const re = z.re, ar = abs(re);
const im = z.im, ai = abs(im);
inline proc infinity() { return if nan_propogation then ar + ai else INF; }
// with a bit of luck, the return of z should involve no branch!!
return if likely(ar != INF && ai != INF) then
z // > 99.99% of cases are this
else
(infinity(), (im.transmute(Uint) & _N).transmute(Real)):complex(w);
} |
This comment has been minimized.
This comment has been minimized.
Return content of a floating point - dominated by finite and normal numbers. ATO Scale a real(w) by a power of 2 with minimalist multiplications and no error. ATO Let's hope I got the ATO links right. Sorry, no unit test case for those yet. But the code highlights the use of likely() and unlikely(). In C, these are 30 and 43 lines of assembler with GCC 11 and are small enough to have a minimal inlineing overhead. I would hope the same is true of Chapel. |
The example attached returns the smallest floating-point number of the identical type as its argument that compares greater than the argument. As finite numbers comprise pretty well all but two of all real(32) numbers, the code to handle finite numbers is prefixed by a likely() conditional to avoid branching in this case. This might help @lucaferranti. It is part of a current project for someone else who wants minimalist code (and maximum performance) for this problem. Here is the link ATO |
Thought bubble reminder - need a likely() and unlikely() with param argument. Is it: inline proc likely(param x : bool) param do return x;
inline proc unlikely(param x : bool) param do return x; Hmmm. |
Urgency level - None. Consider at liesure.
Looking at the subject back-to-front, by likely()/unlikely(), I mean the sort of functionality one might get within C or C++ by using inline code like
I do not mean the C++20 attributes [[likely]] and [[unlikely]] which are aimed at the same general problem but have some issues in some people's opinions. I do not think I am even thinking of these as Chapel code attributes.
By data-profile driven, I mean that in a broad range of applications, parameters to a routine can be assumed to fit a data profile. I will stick to small routines because that is where it makes the most difference based on the profile of the technical application I see day to day
For example, if these parameters are real(w), one might have the case where these can be largely assumed to be normal in an IEEE 754 sense, may even be subnormal (or denormal), and every now and again might be a zero, or an infinity or even NaN, a Not-a-Number. And the processing of these parameters depends on that classification.
Or one may have a routine with parameters which define the behaviour of rocks which may on the rare occasion, crack, but will generally bear compressive loads. So the code which handles the uncracked case is the most important to me for optimize. The rest I might not agonize about. But decisions are made within the code based on the cracked state, or the yielding state (as cracking starts to happen) or the elastic state, or even the plastic state, etc.
I am not talking about Profile Guided Optimization (PGO) or Branch Prediction done by the programmer as these terms are commonly defined in computer science. Definitely not talking about Branch Prediction done by the CPU based on what it learns at run-time.
This is a place holder issue for now. Before proceeding, do I need to better define the concept of data-profile guided optimization, or the IEEE 754 terms which will form the bulk of the examples in this discussion. I am mostly concerned with small routines, specifically those with conditional statements where the cost of branching can be a significant part of the overall performance of the routine, or even where the cost of memory transfers is a high proportion of the performance.
Comments please?
The text was updated successfully, but these errors were encountered: