[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933

damianmoz · 2025-03-18T03:33:49Z

Urgency level - None. Consider at liesure.

Looking at the subject back-to-front, by likely()/unlikely(), I mean the sort of functionality one might get within C or C++ by using inline code like

#ifdef LIKELY
#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)
#else
#define likely(x)           ((x))
#define unlikely(x)         ((x))
#endif

I do not mean the C++20 attributes [[likely]] and [[unlikely]] which are aimed at the same general problem but have some issues in some people's opinions. I do not think I am even thinking of these as Chapel code attributes.

By data-profile driven, I mean that in a broad range of applications, parameters to a routine can be assumed to fit a data profile. I will stick to small routines because that is where it makes the most difference based on the profile of the technical application I see day to day

For example, if these parameters are real(w), one might have the case where these can be largely assumed to be normal in an IEEE 754 sense, may even be subnormal (or denormal), and every now and again might be a zero, or an infinity or even NaN, a Not-a-Number. And the processing of these parameters depends on that classification.

Or one may have a routine with parameters which define the behaviour of rocks which may on the rare occasion, crack, but will generally bear compressive loads. So the code which handles the uncracked case is the most important to me for optimize. The rest I might not agonize about. But decisions are made within the code based on the cracked state, or the yielding state (as cracking starts to happen) or the elastic state, or even the plastic state, etc.

I am not talking about Profile Guided Optimization (PGO) or Branch Prediction done by the programmer as these terms are commonly defined in computer science. Definitely not talking about Branch Prediction done by the CPU based on what it learns at run-time.

This is a place holder issue for now. Before proceeding, do I need to better define the concept of data-profile guided optimization, or the IEEE 754 terms which will form the bulk of the examples in this discussion. I am mostly concerned with small routines, specifically those with conditional statements where the cost of branching can be a significant part of the overall performance of the routine, or even where the cost of memory transfers is a high proportion of the performance.

Comments please?

bradcray · 2025-03-18T16:37:09Z

Hi Damian — Thanks for capturing this request as an issue. With respect to your questions:

do I need to better define the concept of data-profile guided optimization

Nope, I think we've got a good grasp on that.

the IEEE 754 terms which will form the bulk of the examples in this discussion

That may be useful, depending on the terms. :)

Just looking at your motivating example, it seems as though the main thing that would be required to support it would be an equivalent of __builtin_expect() which is not a feature I'm familiar with, but is pretty easy to intuit the meaning of, and these docs confirm that intuition: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fexpect

To support __builtin_expect() the main issues would be:

determining which of our various C and LLVM back-end compilers support such a feature (LLVM being the most important, where it looks like it's supported: https://llvm.org/docs/BranchWeightMetadata.html#built-in-expect-instructions) and deciding how to map to it
designing the name and interface of the routine (simple to do in an unstable way; may require more discussion to eventually stabilize)

Meanwhile, the choices of enabling vs. disabling seem like they could be done in a few different ways via param-based logic or possibly via imports of modules with renaming?

For a C back-end supporting it, it looks like one can just extern declare __builtin_expect() and make use of it, though whether or not this has the desired effect on the generated code isn't immediately obvious to me: ATO

For LLVM, I think we'll need to do a bit more to plumb it through, though it does look as though we'll be able to.

Seeing the variations that take a probability, it seems like we could also provide an optional argument to enable specifying probabilities on a case-by-case basis as well (where the default could be set by another config param, or perhaps just indicate that __builtin_expect() should be called without passing any probability). I started to sketch this out, but am going to stop here instead to not get ahead of ourselves.

damianmoz · 2025-03-19T00:32:59Z

Thanks for looking at this Brad. I will put together some examples in C next week so that I can change the isn't in your comment "whether or not this has the desired effect on the generated code isn't immediately obvious to me" ... to is.

For a random problem, let's consider the chance of any 16-bit floating point number appearing as the value of a real(16). The chance that it is a normal number is about 97% of the cases, a subnormal about 3% of the time and an infinity or a zero it is about 0.006%. The chance of an NaN is arguably about 0.003% although it can be higher if the programmer is having a bad day so its statistics is more related to how often the algorithm is dodgey. I could have used the probability for real(32) but I cannot do those in my head.

Wrapping some boolean expression in unlikely() tells the compiler that branching overhead to get to that block of code whose execution that boolean expression mandates is OK by me. Encapsulating it in a likely() says "please avoid branching".

Looking a a paraphrase of the code in #25088, it says

if an encoding of a magnitude < the least positive normal then
   do THIS // the datum is a subnormal or a zero
else
   do THAT // the datum is a normal or an infinity or a NaN

97% of the time I want to do THAT. Only 3% of the time will I want to do THIS. So I prefer that there be no branching before it does THAT, i.e.

    // fake - assembler
    if the encoding < the least positive normal then JUMP to L1
    do THAT
    return
L1:
    do THIS
    return

So, I would tell the compiler that the object of the then was unlikely, well, ..., technically less likely. So if this feature was available, I could hint to the compiler with:

if unlikely(an encoding of a magnitude < the least positive normal) then
   do THIS // the datum is a subnormal or a zero
else
   do THAT // the datum is a normal or an infinity or a NaN

And I would expect that I would get something like that fake assembler about.
I sometimes look at what likely/unlikely yields and then mess with the order
of the if/then/else but WITHOUT likely/unlikely and see if I can replicate. But
mostly I can cannot replicate it quite as well.

As a general rule (and I hate that word), you prefix the much less common cases such as the processing of infinities or NaN or zero with unlikely() and the dominant case with likely(). As the precision of a floating point increases, i.e. with a real(32) or real(64), the chance of these non-normal cases becomes even much less common so the performance increases. For real(32) cases which I use a lot, the benefit is significant. Sometimes I play around with the ordering of the likely() and unlikely() cases. and review the assembler to figure out what is optimal although that can change marginally with new releases of an optimizing compiler or whether one is using one compiler or another..

I generally run my algorithms for every single real(32) in existence and compare that result against the same algorithm using real(64) arithmetic. So I encounter all these cases. In practical algorithms, I mostly see an even higher percentage of these normal cases so the benefits are even more significant.

I will rework that real(16) code with likely/unlikely in C, and do the same on a complex projection, a real scaling algorithm and an algorithm to yield the content (f, n) of any floating point number x as

x == 2**n * f where 1.0 <= | f | < 2.0

I will discuss the assembler produced for an X86-64 and hopefully make the benefits obvious. I might look at nextafter() or nextup() but they are not commonly used routines. I will stick to simple algorithms where the code and assembler is clearer.

While I am sure that there are cases where makes the decision based on precise probability would be optimal, if you have to go that far of looking at probabilities, I think the original programmer needs to go back and look at the underlying algorithm.
Besides, if you are comparing probabilities like 80% and 20% you are wasting your time in my opinion, at least for data-profiled stuff. The likely() case should be very dominant.

I only use this technique for routines which yield less than 100 or so assembler instructions although I sometimes play around with stuff in the 100-200 line range to see what happens just out of interest. I sometimes get a nice surprise but generally find the second case unproductive and certainly less predictable.

damianmoz · 2025-03-20T04:55:12Z

I will post those examples in full once I figure out how to put an ATO link in these posts.

Here is another example with complex arithmetic

// map z to itself unless z has an infinite component in which
// case it maps to the infinity in the extended complex plane

proc proj(z : complex(?w))
{
    param nan_propogation = true;
    param v = w >> 1;
    type Real = real(v);
    type Uint = uint(v);
    param INF = inf:Real;
    param _N = 1:Uint << (v - 1);
    const re = z.re, ar = abs(re);
    const im = z.im, ai = abs(im);

    inline proc infinity() { return if nan_propogation then ar + ai else INF; }

    // with a bit of luck, the return of z should involve no branch!!

    return if likely(ar != INF && ai != INF) then
        z // > 99.99% of cases are this
    else
        (infinity(), (im.transmute(Uint) & _N).transmute(Real)):complex(w);
}

damianmoz · 2025-03-22T06:42:06Z

Return content of a floating point - dominated by finite and normal numbers. ATO

Scale a real(w) by a power of 2 with minimalist multiplications and no error. ATO

Let's hope I got the ATO links right. Sorry, no unit test case for those yet. But the code highlights the use of likely() and unlikely(). In C, these are 30 and 43 lines of assembler with GCC 11 and are small enough to have a minimal inlineing overhead. I would hope the same is true of Chapel.

damianmoz · 2025-03-24T00:42:11Z

The example attached returns the smallest floating-point number of the identical type as its argument that compares greater than the argument. As finite numbers comprise pretty well all but two of all real(32) numbers, the code to handle finite numbers is prefixed by a likely() conditional to avoid branching in this case.

This might help @lucaferranti. It is part of a current project for someone else who wants minimalist code (and maximum performance) for this problem.

Here is the link ATO

damianmoz · 2025-03-24T01:22:47Z

Thought bubble reminder - need a likely() and unlikely() with param argument. Is it:

inline proc likely(param x : bool) param do return x;
inline proc unlikely(param x : bool) param do return x;

Hmmm.

damianmoz added the type: Feature Request label Mar 18, 2025

bradcray added the user issue label Mar 18, 2025

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933

[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933

damianmoz commented Mar 18, 2025

bradcray commented Mar 18, 2025

damianmoz commented Mar 19, 2025 •

edited

Loading

damianmoz commented Mar 20, 2025 •

edited

Loading

This comment has been minimized.

damianmoz commented Mar 22, 2025

damianmoz commented Mar 24, 2025

damianmoz commented Mar 24, 2025 •

edited

Loading

[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933

[Feature Request]: data-profile guided optimization using likely()/unlikely() #26933

Comments

damianmoz commented Mar 18, 2025

bradcray commented Mar 18, 2025

damianmoz commented Mar 19, 2025 • edited Loading

damianmoz commented Mar 20, 2025 • edited Loading

This comment has been minimized.

damianmoz commented Mar 22, 2025

damianmoz commented Mar 24, 2025

damianmoz commented Mar 24, 2025 • edited Loading

damianmoz commented Mar 19, 2025 •

edited

Loading

damianmoz commented Mar 20, 2025 •

edited

Loading

damianmoz commented Mar 24, 2025 •

edited

Loading