SymbolExternalizer needs a complete rewrite #29

giulianobelinassi · 2024-05-31T14:01:25Z

The SymbolExternalizer engine needs a complete rewrite so we can do proper optimizations related to clang-extract's execution speed and code size.

With regard to speed:

Use a proper Interval Tree to store which SourceRanges we changed. This should reduce the intersection computation from O(n ²) to O(n log n), where n is the number of changes.
Use a use-def-chain to quickly figure out what are the variables we need to change. This would also allow us to do some optimizations with regard to code size.
Try to use the RecursiveASTVisitor instead of recursively looking into all statements.

With regard to code size:

Try to postpone the location of the rewrote symbol definition so that it is written in the main file, instead of the header it was originally written. This should avoid header expansion by placing the externalized symbol in the main file.

The same features is currently have must be kept. We should also think of ways of implementing #5.

giulianobelinassi · 2024-06-27T02:10:51Z

Commit 625e4af addresses 2 and 3. The issue addressed by 1 was fixed when the IntervalTree was introduced.

Now with regard to the last comment, here is a brain dump of my ideas with regard to how to do it.

Clang-extract is able to rename the symbols we are externalizing, e.g. function => klpe_function. This allows us to do some tricks regarding the position in which we declare this new variable. For example, let's say we want to externalize function here, where g contains the first use of the symbol function:

// [1]
int function(void)
{
}
// a lot of code here
// [2]
int g(void)
{
  function();
}

The resulting code would be:

// [1]
int (*klpe_function)(void);

// [2]
int g(void)
{
  (*klpe_function)();
}

Notice the following here: the declaration of klpe_function could move freely in range between [1] and [2] because the first use of klpe_function is in [2]. But beware that this is not a true statement for all cases: one can construct a case where this is not true by poisoning symbols, etc. I would argue that those cases were created only for counter-argument purposes and would not show up in real-life scenarios.

With this cleared, we can construct even more complex cases by introducing #includes. Let's craft the following example, where the identation level represents how dept it is in the include tree:

#include a.h
  #include b.h
    #include c.h
      int function(); // to be externalized
      
 // Then back in the main file:
 int g()
 {
 function();
 }

In this case, the first use of function is in the main file itself, but its definition is not. Currently, when externalizing it, clang-extract would walk upwards the IncludeTree and mark all headers as to be expanded, which in this case would be a.h, b.h, and c.h. This can lead to a lot of unnecessary code being dumped into the main file.

Now, let's say we moved the declaration of the externalized variable just before its first use in this case. The resulting code would be:

#include "a.h"

int (*klpe_function)(); //externalized
     
// Then back in the main file:
int g()
{
 (*klpe_function)();
}

Here, it is not necessary to expand a.h. The old declaration of function will remain as a consequence of this, but since there are no uses of it, it wont be a problem. As a consequence, this transformation has the potential to reduce the number of lines of a generated livepatch.

Now, a way to implement this would be to first account the SourceLocation of the DeclarationDecl of the symbol we want to externalize, and account the SourceLocation of the first symbol use. Notice, however, that for the use we need to go back to the TopLevel, so we actually need the SourceLocation of the DeclarationDecl that has the use of the symbol, and not the SourceLocation of the symbol itself (i.e., it is not as simply as taking the expr->getBeginLoc() of a DeclRefExpr). On the above example, we need to go back to the int g() SourceLocation.

This problem can be solved efficiently by using the IntervalTree, or not-so-efficiently by doing a linear search on the ASTUnit::top_level_iterator and checking in which decl->getRange() this SourceLocation belongs to. If we manage to do it on a single Toplevel AST sweep, when the second way would be more efficient, though.

Now, if we do not want to do the same thing with the declaration that contains the use of that externalized variable (that means, try to postpone its declaration as well) then we know that the file that contains the function that uses the externalized variable is expanded, else the change would have been lost. So we go there just before the declaration and write the externalized variable there.

One thing that may be of a concern are Macros. Clang-extract often get confused when the declaration comes as a result of a macro, so if the first use of the function comes as a result of a macro expansion, perhaps it would be a good idea to abort this optimization attempt then. You can detect this because the Declaration will have all sort of weirdness in the SouceRange, but the most common one is the result of PrettyPrint::Get_Source_Text returning an empty string.

Since we at this point we are not trying to postpone the declaration that Uses the externalized symbol, we do not need to use the IncludeTree here, and the optimization code should be reduced quite a lot when compared to the general case.

giulianobelinassi · 2024-07-11T20:10:41Z

Commit 74a63c6 implements the last item.

giulianobelinassi added the enhancement New feature or request label May 31, 2024

marcosps self-assigned this Jun 20, 2024

giulianobelinassi closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SymbolExternalizer needs a complete rewrite #29

SymbolExternalizer needs a complete rewrite #29

giulianobelinassi commented May 31, 2024 •

edited

Loading

giulianobelinassi commented Jun 27, 2024 •

edited

Loading

giulianobelinassi commented Jul 11, 2024

SymbolExternalizer needs a complete rewrite #29

SymbolExternalizer needs a complete rewrite #29

Comments

giulianobelinassi commented May 31, 2024 • edited Loading

giulianobelinassi commented Jun 27, 2024 • edited Loading

giulianobelinassi commented Jul 11, 2024

giulianobelinassi commented May 31, 2024 •

edited

Loading

giulianobelinassi commented Jun 27, 2024 •

edited

Loading