Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple constant propagation AST-based analysis #852

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

jeshecdom
Copy link
Contributor

Issue

Closes #716.

The solution is able to detect not only division by zero problems, but any kind of problem that depends on variable tracing, like null dereferencings, number overflows. Although, I need to add testing for all the other possibilities.

Checklist

  • I have updated CHANGELOG.md
  • I have added tests to demonstrate the contribution is correctly implemented: this usually includes both positive and negative tests, showing the happy path(s) and featuring intentionally broken cases
  • I have run all the tests locally and no test failure was reported
  • I have run the linter, formatter and spellchecker
  • I did not do unrelated and/or undiscussed refactorings

…It only refactors the interpreter to be able to change the semantics or add additional functionality to it.

At first, I was trying to do a monad-style interpreter, but TypeScript does not have do-notation, which can quickly produce unreadable code because of the bind functions.
Although, there are some suggestions over the internet on how to emulate do-notation using generators, it looks ad-hoc, hacky, and unnatural.
So, I decided to take the idea of monad and adapt it into an imperative flavor.
The idea is to define the semantics of an interpreter as an abstract class parametric over two generic types: "V" (the type of expressions' results) and
"S" (the type of statements' results).
At the moment, module level declarations do not have an associated generic type, but it will probably be added.

Specific semantics can be represented by implementing the abstract class.
For example, to create an interpreter with the standard semantics, just write:

const interpreter = new Interpreter(new StandardSemantics(ctx));

If we want to change the behavior, for example, by collecting special values when statements execute, just pass another semantics:

const interpreter = new Interpreter(new YourCustomSemantics(ctx));

This is a first version of the "monadish" interpreter. As such, it probably will suffer changes as more semantics are added since
they will probably force changes to the interface in the abstract class.
…hod.

It also supports variable tracing in structs.
… did not show it. This was due to partial evaluator making use of dummySrcInfo.
- forgot to change initialSctx in foreach statement to foreachSctx
@jeshecdom jeshecdom requested a review from a team as a code owner September 19, 2024 16:06
@anton-trunov anton-trunov self-assigned this Sep 19, 2024
@anton-trunov anton-trunov added this to the v1.6.0 milestone Sep 19, 2024
@jeshecdom
Copy link
Contributor Author

The main idea of the approach is to keep a map from variable names to either a value or undefined:

Map<string, Value | undefined>

This map is stored in the statement context, and it is used to track the value that variables have so far in the program.

We say that a variable is "undetermined" if either:

  • It is not a key in the map, or
  • It is a key and it maps to undefined.

The approach keeps track of the value of each variable so far. For example, in this program snippet:

let a = 5;   // A
let b = a;   // B
a = 10;      // C

After line A executes, the map will be:
a --> 5
After line B:
a --> 5, b --> 5
After line C:
a --> 10, b --> 5

During this trace analysis, a variable can become undetermined mainly because of the following reasons:

  1. The variable gets assigned an expression that cannot be evaluated at compile time.
  2. The variable gets changed by a mutating function.
  3. The variable gets assigned different values at different branches of the program.

I'll explain each case now.

Case 1

Consider this function:

fun test(v: Int) {   // A
   let a = 10;       // B
   a = v - v;        // C
   a = v;            // D
}

After line A, the bindings map is empty. Note it would be equivalent if instead we attach the binding:
v --> undefined
I decided not to add bindings while processing function declarations because in this way I did not have to add code on that part of the codebase. Independently of the decision to add or not the arguments in a function declaration, the procedure respects the following invariant while tracing a single branch in the code:

The keys in the bindings map only grow or remain the same.

Because this makes the process of merging different branches in the code easier (as will be explained in Case 3).

After line B, the map will be:
a --> 10
After line C:
a --> 0
Note that contrary to expectation, at line C, variable a actually has a value, because the analyzer uses partial evaluation, and v - v = 0 independently of the value of v.
After line D:
a --> undefined
because v is undetermined.

Case 2

Consider this program:

extends mutates fun changeMe(self: Int) {
   self = 5;
}

fun test(v: Int) {   // A
   let a = 10;       // B
   a.changeMe()      // C
}

After line A, the map is empty.
After line B:
a --> 10
After line C:
a --> undefined
The reason is that the analyzer treats mutating functions as black boxes. Therefore, after a.changeMe() executes, the analyzer concludes that a could have an arbitrary value.
This decision of treating mutating functions as black boxes is enough to emulate the behavior of FunC, because it seems that FunC stops the analysis whenever a variable gets assigned the result of a function call.
Note however that if we remove the attribute mutates from the changeMe function, then, the analyzer will conclude that after line C, the bindings map is actually:
a --> 10
Because changeMe does NOT mutate a.

Case 3

If control flow branches and then joins, then the binding maps of each branch will merge at the joint point using the following rule:

Suppose var is a variable that existed in the bindings map before 
the control flow branched. 
If var has the same value v in all the binding maps of each branch, then 
var --> v 
is in the binding maps at the joint point. Otherwise, 
var --> undefined
is in the binding maps at the joint point.

This is better exemplified with an example. Consider this function:

fun test (v: Int) {
   let a = 10;       
   let b = 6;        // A
   if (v >= 5) {
      a = 7;         
      b = 20;        // B
   } else { 
      a = 8;         
      b = 20;        // C
   }
                     // D
}

Control flow at A branches into B and C, which then join at D.
At A, the map is a --> 10, b --> 6. While the maps in B and C are a --> 7, b --> 20 and a --> 8, b --> 20, respectively.

We would compute the map at D as follows. For each variable x in the map at A,
check if x has the same value in the maps at B and C. If it does, add it to the map at D with the common value. If x does not have the same value in B and C, then add x --> undefined at D.

Hence, at D the map will be a --> undefined, b --> 20, because a has different values at B and C, but b has the same value 20.

Sometimes the analyzer is able to determine that a particular branch will always be taken. In those cases, instead of merging the binding maps at the joint point, the analyzer simply takes the map of the executed branch, by using the following rule:

Suppose var is a variable that existed in the bindings map before 
the control flow branched. 
Suppose that branch A will always be taken.
If var has value v in branch A, then 
var --> v 
is in the binding maps at the joint point.

For example, in the above program, condition v >= 5 cannot be evaluated at compile time. So, the analyzer will merge branches at D. But suppose instead that the program was:

fun test (v: Int) {
   let a = 10;       
   let b = 6;        // A
   if (a >= 5) {
      a = 7;         
      b = 20;        // B
   } else { 
      a = 8;         
      b = 20;        // C
   }
                     // D
}

Then, the bindings map at D will be a --> 8, b --> 20 because branch C will always be taken, i.e., the condition a >= 5 can be evaluated at compile time.

One last important note. Observe that the rules always start by stating: Suppose var is a variable that existed in the bindings map before the control flow branched. This means that variables declared inside the branches will not survive at the joint points. For example, in this program:

fun test (v: Int) {
   let a = 10;       
   let b = 6;        // A
   if (v >= 5) {
      let z = true;
      a = 7;         
      b = 20;        // B
   } else { 
      let x = false;
      a = 8;         
      b = 20;        // C
   }
                     // D
}

variables x and z will not be in the bindings map at D, because x and z where not in the bindings map at A before the control flow branched.

Handling loops

So far, I haven't talked about how loops are handled by the analyzer. Consider the following function:

fun test (v: Int) {
   let a = 10;       
   let b = 6;        // A
   while (v >= 5) {
      v -= 1;
      a = v;         
      b = 20;        // B
   }
                     // C
}

There are two possible branches at A: the loop executes (which follows branch B) or it does not (which jumps directly to C). Both branches join at C.

If we follow branch B, we must carry out the analysis under the assumption that the loop has already executed an arbitrary number of times. This implies that we cannot assume that the values of a and b are 10 and 6 at line v -= 1, because they could have changed a lot due to the fact that they get assigned inside the loop. In other words, we must start the analysis of branch of B with the bindings a --> undefined, b --> undefined, and then carry out the analysis of each line inside the loop body.

So, in the above example, the map at B will be: v --> undefined, a --> undefined, b --> 20, because a gets assigned v, which is undefined.
The case when the loop does not execute produces the map: a --> 10, b --> 6.
Therefore, merging these two maps at C will produce: a --> undefined, b --> undefined because both a and b have different values on each branch.

If instead, we have the program:

fun test (v: Int) {
   let a = 10;       
   let b = 6;        // A
   while (v >= 5) {
      v -= 1;
      a = v - v + 10;         
      b = 20;        // B
   }
                     // C
}

Then, the map at B will be: v --> undefined, a --> 10, b --> 20. And the map when the loop does not execute: a --> 10, b --> 6. Therefore, merging these maps at C will produce: a --> 10, b --> undefined, because a has the same value in both branches, but b does not.

The above examples suggest a general procedure to handle the branch inside the loop:

Suppose that var is a variable in the binding map before entering the loop. 
Suppose that var is assigned inside the loop.
Then,
v --> undefined 
should be in the bindings map at the first line inside the loop

As was the case with conditionals, sometimes the analyzer is able to determine if a loop will execute or not. In that case, it will take the binding map of the corresponding branch. For example, consider this function:

fun test (v: Int) {
   let a = 10;       
   let b = 6;         // A
   while (a >= 5) {
      a -= 1;
      b = 20;         // B
   }
                      // C
}

In this case, the condition a >= 5 is true at compile time. Hence, the analyzer takes the map of branch B to be the map at C without doing merging. In the example, the map at C will be: a --> undefined, b --> 20, which is the same at B.

Some pointers to the code

  • Probably I need to add the above explanation in the source code itself (maybe after the declaration of the bindings map in StatementContext in resolveStatements.ts).
  • Function mergeBranches in resolveStatements.ts implements the bindings map merging algorithm.
  • Function copyBindings in resolveStatements.ts implements the logic when the analyzer determines that a branch will always be taken, so it simply copies the bindings in the taken branch.
  • Function makeAssignedVariablesUndetermined in resolveStatements.ts prepares the bindings for the analysis of a loop body, i.e., mark as undefined all variables before the loop that get assigned inside loop.
  • I also recommend to read the explanation before function setVariableBinding in resolveStatements.ts, because it explains how structs and contracts are handled by this approach.

@anton-trunov anton-trunov changed the title Issue 716 Simple constant propagation AST-based analysis Sep 24, 2024
Partial commit (does not compile).
…elated to variable tracing to a test folder inside interpreters folder (since constant propagation analyzer is an instance of interpreter).

However, one test case is still failing, but it does not have to do with the analyzer. Will merge with main to see if it solves the issue.
@jeshecdom
Copy link
Contributor Author

Please DO NOT review this PR yet. It still lacks some code documentation. I will also add some modifications to the handling of loops. I will write a comment in the morning explaining the rationale.

@jeshecdom
Copy link
Contributor Author

The current code in the PR still implements the analysis of loops as I described them in the looooong comment I did above. I was still in doubt that loop analysis using ASTs could be done by carrying out iterations because I was trapped in the idea of the naive approach that I describe in these new notes below. But it turns out that there is a way to modify the naive approach to actually make it work (as I describe in these notes as well).

So, yes, you were correct @anton-trunov. I just needed to carry out the proofs to convince myself =). I will incorporate this way of analyzing the loops later, since now I will switch to reviewer mode. The change should not take too long because all the infrastructure is already in the PR.

The following notes are just a way for myself to remember (and to convince myself of) the details of the idea and to explain to the reviewers the rationale of the analyzer.


The semilattice of environments

Denote by $Var$ the set of variables in the program. Denote by $V$ the set of values, which contains a special value $\bot$ that denotes an "undefined value" (I sometimes call it, undetermined value).
For example, when the analyzer starts a function like:

fun test1(arg1: T1, arg2: T2) {
..........
}

It assigns to arg1 and arg2 the undefined value $\bot$.

An environment is a partial function from $Var$ to $V$. Equivalently, an environment is a set of pairs $(a,b)$, where $a$ is a variable and $b$ is a value, which means that variable $a$ maps to value $b$. In programming terms, an environment is simply a map: Map<Var,V>.

For example, in function test1 above, after processing the arguments, the environment is ${(arg1,\bot),(arg2,\bot)}$.

The analyzer uses the following join operation on environments (denoted by the symbol $\cup$):

$$e_1 \cup e_2 = \{ (x, e_1(x) = e_2(x)\ ?\ e_1(x) : \bot) \mid x\in dom(e_1) \cap dom(e_2) \}$$

In other words, $e_1 \cup e_2$ takes the variables that are common to both $e_1$ and $e_2$ (dropping the rest of the variables), and for each common variable $x$, it assigns to it $\bot$ if the value of $x$ differs in both environments; otherwise, $x$ gets assigned the common value in the environments.

The rationale for using such join has to do with how branches are joined during analysis. For example, consider this function with a single conditional:

fun test(v: Int) {
   let a: Int = 5;
   let b: Int = 10;
   if (v < 10) {
      let c: Int = 20;
      b = 2;
   } else {
      let d: Int = 50;
      b = 7;
   }
   // Join point
}

Since it is not possible to know at compile time the value of v, it is not possible to know which branch the conditional will take, so we need to execute both branches and join them at the join point at the end of the conditional. After executing the "then" branch, our environment is ${(a,5),(b,2),(c,20)}$. After executing the "else" branch, our environment is ${(a,5),(b,7),(d,50)}$.
Note that at the join point, variables $c$ and $d$ are no longer in scope. In other words, it is safe to drop $c$ and $d$ at the join point. Note that variable $a$ did not change in both branches, which means it will have value $5$ at the join point. However, variable $b$ does not have the same value in both branches, which means that its value is undefined at the join point (i.e., $b$ could have either value). So, at the join point, the environment is ${(a,5),(b,\bot)}$, which is precisely the result of the join operator:

$$\{(a,5),(b,2),(c,20)\}\cup\{(a,5),(b,7),(d,50)\}=\{(a,5),(b,\bot)\}$$

The $\cup$ operator satisfies the three requirements for a semilattice over the set of environments: it is idempotent, commutative and associative. As such, it induces a join-semilattice where the partial order is defined as:

$$e_1 \leq e_2 \iff e_1 \cup e_2 =e_2$$

For example,

  • ${(a,5)} \leq {(a,\bot)}$ because: ${(a,5)} \cup {(a,\bot)} = {(a,\bot)}$.
  • ${(a,5), (b,10)} \leq {(a,5)}$ because: ${(a,5),(b,10)} \cup {(a,5)} = {(a,5)}$.
  • ${(a,5), (b,10),(c,30)} \leq {(a,\bot),(b,10)}$ because ${(a,5), (b,10),(c,30)} \cup {(a,\bot),(b,10)} = {(a,\bot),(b,10)}$.

In other words, $e_1 \leq e_2$ means that $e_1$ is less "undetermined" than $e_2$, or, equivalently, that $e_1$ is more "defined" than $e_2$.

Also, note that,

  • $e_1 \leq e_1 \cup e_2$, because $e_1 \cup (e_1 \cup e_2) = (e_1 \cup e_1) \cup e_2 = e_1 \cup e_2$, due to associativity and idempotency of $\cup$.
  • $e_2 \leq e_1 \cup e_2$. It has a similar argument.

Handling loops

Naive Approach (Fails, but gives insight into what needs to be done)

Given a loop like:

while(cond) {
.....
}
// join point

Suppose for a moment that the value of condition cond is unknown in our current environment e. Therefore, to compute the environment at the join point, it seems that need to join all environments that result from executing the loop, 0, 1, 2, 3, 4, ..... times. Denote by $b(x)$ the result of executing the loop body on environment $x$ once. So, this suggests that we should compute the join as follows (though, we will see that this does not work, even if it looks quite intuitive):

$$e \cup b(e) \cup b(b(e)) \cup b(b(b(e))) \cup \ldots$$

where $e$ denotes the result of executing the loop body 0 times, $e \cup b(e)$ the join of executing the body 0 and 1 times, $e \cup b(e) \cup b(b(e))$ the join of executing the body 0, 1, and 2 times, etc.
Or equivalently,

$$b^0(e) \cup b^1(e) \cup b^2(e) \cup b^3(e) \cup \ldots$$

where $b^i(e)$ denotes the application of $b$ $i$ times starting on $e$, and it is defined recursively as:

  • $b^0(x) = x$
  • $b^{i+1}(x)=b(b^i(x))$

So, if the sequence of computations:

$$b^0(e),\quad b^0(e) \cup b^1(e),\quad b^0(e) \cup b^1(e) \cup b^2(e),\quad b^0(e) \cup b^1(e) \cup b^2(e) \cup b^3(e),\quad \ldots$$

eventually stabilizes, we got our join for the loop.

But notice that in our semilattice, we have $e_1 \leq e_1 \cup e_2$, which means that the above sequence is actually an increasing sequence:

$$b^0(e) \leq b^0(e) \cup b^1(e) \leq b^0(e) \cup b^1(e) \cup b^2(e) \leq b^0(e) \cup b^1(e) \cup b^2(e) \cup b^3(e) \leq \ldots$$

Even though it is possible to prove that this increasing sequence eventually stabilizes (i.e., there is a step in the sequence from which the environment never changes again), the sequence may have several "fake stabilization" periods in which the sequence seems to have stabilized for several steps before abruptly change into another "fake stabilization" period. Therefore, it is very difficult (if not impossible) to detect if we are in the final stabilization period or within a fake one, which may lead to a premature stop of the iterative process.

The above problem is better illustrated with an example. Consider this function:

fun test(v: Int) {
   let i: Int = 0;
   let a: Int = 10;
   while (v < 10) {
      if (i >= 5) {
         a = 15;
      }
      i += 1;
   }
}

We will have the following environments after the respective iterations:

  • 0 iterations: ${(v,\bot), (i,0), (a,10)}$.
  • 1 iteration: ${(v,\bot), (i,1), (a,10)}$.
  • 2 iterations: ${(v,\bot), (i,2), (a,10)}$.
  • .......
  • 5 iterations: ${(v,\bot), (i,5), (a,10)}$.
  • 6 iterations: ${(v,\bot), (i,6), (a,15)}$. (Note this is the first time $a$ changed to 15).
  • The rest of environments will keep increasing $i$, but will not change $a$ any more.

So, the increasing sequence will be:

  • First step: ${(v,\bot), (i,0), (a,10)}$.
  • Second step (joining the 0 and 1 iterations): ${(v,\bot), (i,\bot), (a,10)}$.
  • Third step (joining the 0, 1, 2 iterations): ${(v,\bot), (i,\bot), (a,10)}$.
  • ....
  • Sixth step (joining the 0...5 iterations): ${(v,\bot), (i,\bot), (a,10)}$.
  • Seventh step (joining the 0...6 iterations): ${(v,\bot), (i,\bot), (a,\bot)}$.
  • From now on, the sequence will not change anymore.

Note that from the second step to the sixth step, the sequence seemed to have stabilized (i.e., it is in a "fake stabilization" period), and it would be unsound to stop the iterative process before the seventh step because we would incorrectly conclude that $a=10$. However, there is no way to tell if we are in a fake period or if we arrived to the final stabilization just by looking into the environments themselves.

The problem with this naive approach is that the information computed by the joins is never given to the environments computed during the iterations. For example, the data $i=\bot$ in the second step is never given as input to the computation of the 2nd iteration. If we knew that $i=\bot$ before starting the second iteration, the if-statement inside the loop would become undetermined, forcing a branch analysis on variable $a$, and we would realize that $a = \bot$ because in one branch it changes to 15, while in another it remains as 10. This is the idea for the next approach.

Second Approach (Successful)

The idea is that at each step we should take the result of the join of the previous step as input to the next iteration of the loop. More specifically:

  • Environment after we iterate 0 times: $e$.
  • Environment after we iterate 1 times: take the result we computed up to 0 times, which is $e$, give it as input to the loop body and join the results: $e \cup b(e)$.
  • Environment after we iterate 2 times: take the result we computed up to 1 times, which is $e \cup b(e)$, give it as input to the loop body and join the results: $e \cup b(e) \cup b(e \cup b(e))$.
  • Environment after we iterate 3 times: take the result we computed up to 2 times, which is $e \cup b(e) \cup b(e \cup b(e))$, give it as input to the loop body and join the results: $e \cup b(e) \cup b(e \cup b(e)) \cup b(e \cup b(e) \cup b(e \cup b(e)))$.
  • And so on.

In other words, we are computing the following sequence of environments:

  • $e$.
  • $e \cup b(e)$.
  • $e \cup b(e) \cup b(e \cup b(e))$.
  • $e \cup b(e) \cup b(e \cup b(e)) \cup b(e \cup b(e) \cup b(e \cup b(e)))$.
  • etc.

The notation can be improved by introducing the following function:

$$B(x) = x \cup b(x)$$

together with the superindex notation $B^i(x)$ to indicate that $B$ is applied $i$ times on $x$.

So, the sequence becomes:

  • $B^0(e) = e$.
  • $B^1(e) = B(e) = e \cup b(e)$.
  • $B^2(e) = B(B^1(e)) = B^1(e) \cup b(B^1(e)) = e \cup b(e) \cup b(e \cup b(e))$.
  • $B^3(e) = B(B^2(e)) = B^2(e) \cup b(B^2(e)) = e \cup b(e) \cup b(e \cup b(e)) \cup b(e \cup b(e) \cup b(e \cup b(e)))$.
  • etc.

Therefore, if this sequence eventually stabilizes, we would get our result. Note that requiring a stabilized sequence is the same as requiring that the sequence eventually reaches a fixpoint for function $B$.

In fact, it is possible to prove that the above sequence is actually an increasing sequence that reaches a fixpoint for $B$. More generally, for any starting environment $x$, we have that:

$$B^0(x) \leq B^1(x) \leq B^2(x) \leq B^3(x) \leq \ldots$$

eventually reaches a fixpoint for $B$ (a.k.a., stabilized sequence). Note that this does not require that $B$ be monotone! The result follows from facts about the particular semilattice we are using and the particular definition of the join operator $\cup$ (the fact that $B$ is not required to be monotone is actually great because I suspect that $B$ is not monotone in this semilattice).

The fact that the above result works for any starting environment is actually very important, because when starting the analysis of a loop, we are in some arbitrary environment. Moreover, the fact that the sequence reaches an environment $y$ satisfying $B(y) = y$ implies that we can stop the steps in the sequence at $y$ because applying one more step $B$ will give back $y$ again (i.e., the sequence cannot abruptly change after several steps of $y$ as was occurring in the naive approach).

I will not write the proofs in this note, but I will illustrate how this approach behaves with the problematic function described in the naive approach.

Here is the function again for convenience:

fun test(v: Int) {
   let i: Int = 0;
   let a: Int = 10;
   while (v < 10) {
      if (i >= 5) {
         a = 15;
      }
      i += 1;
   }
}

We will have the following environments after the respective steps:

  • Step 0: ${(v,\bot), (i,0), (a,10)}$.
  • Step 1: ${(v,\bot), (i,0), (a,10)} \cup {(v,\bot), (i,1), (a,10)} = {(v,\bot), (i,\bot), (a,10)}$.
  • Step 2: Before starting the iteration, $i = \bot$ from the previous step. Hence, this will force a branch analysis in the if-statement inside the loop. So that $a = \bot$ because it has different values in different branches. Hence, after the iteration, the environment is ${(v,\bot), (i,\bot), (a,\bot)}$. This means that the environment for the second step is: ${(v,\bot), (i,\bot), (a,10)} \cup {(v,\bot), (i,\bot), (a,\bot)} = {(v,\bot), (i,\bot), (a,\bot)}$
  • Step 3: At this point, applying $B$ on step 2 will produce back ${(v,\bot), (i,\bot), (a,\bot)}$. Hence, we have arrived at the fixpoint, which means that the environment at the join point is ${(v,\bot), (i,\bot), (a,\bot)}$.

The big question here is: is this analysis too restrictive in the sense that it will assign to a lot of variables the value $\bot$ even in cases where it is safe that a particular variable will never be assigned in a specific branch? For example, modify the above program as follows:

fun test(v: Int) {
   let i: Int = 0;
   let a: Int = 10;
   while (v < 10) {
      if (i < 0) {
         a = 15;
      }
      i += 1;
   }
}

Here, a human programmer can see that since $i$ always starts with value 0 and only increases inside the loop, the if-statement inside the loop will never change variable $a$, independently of how many times the loop executes. However, if we repeat the analysis, the fixpoint will assign $a=\bot$ because once $i$ gets assigned $\bot$, the analyzer is forced to do a branch analysis in the if-statement.

In spite of the above, I think that this analysis is enough to cover the FunC errors.

@anton-trunov
Copy link
Member

@jeshecdom btw, the traditional approach to assigning meanings to the lattice elements is as follows: "bottom" means "unreachable" or some kind of type error (like not a number if you are only tracking numbers in your analysis) and you also want the top element, which in this case would mean "any value", so you (semi-)lattice would look like

          T
/   |   |  ... \
0   1   2  ... 
\   |   |  ... /
          ⊥

@jeshecdom
Copy link
Contributor Author

jeshecdom commented Oct 28, 2024

Thank you!

I will attempt to restate the problem in the more traditional way. So, in the traditional way, you start with describing a complete lattice on the values first, and then extend it to the environments I see. But this would imply that in my environment map Map<string,ValueWithStatus> in the implementation I will need to store in ValueWithStatus the two modes: deleted due to some error ($\bot$) and any possible value ($\top$). This implies that I need to redefine the join operator $\cup$ to handle those two possible modes in the values. I will check later if this adds improvements on the analysis results.

}

public interpretModuleItem(ast: AstModuleItem): void {
export abstract class InterpreterInterface<T> {
Copy link
Contributor

@verytactical verytactical Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOP in Typescript, and abstract in particular are heavily unsound. If that wasn't bad enough, they're broken even in Javascript, where o.f() and (o).f() have different meaning, and to add insult to injury this kind of problems are not checked by Typescript at all.

Abstract interpreter is best expressed using "tagless final" pattern. Here's a simple example.

Imagine we have a language of numbers and addition. We define

interface Expr<T> {
    num: (value: number) => T;
    add: (left: T, right: T) => T;
}

and can now express 1 + (2 + 3) as

const example = <T>({ num, add }: Expr<T>) => add(num(1), add(num(2), num(3)));

Then we can define concrete interpreters for Expr:

const show: Expr<string> = {
    num: (value) => `${value}`,
    add: (left, right) => `${left} + ${right}`,
};

const run: Expr<number> = {
    num: (value) => value,
    add: (left, right) => left + right,
};

The only weird thing here is that whenever we apply an interpreter to a value, it's actually applying a value to interpreter: example(show). If that makes anyone uncomfortable, it's easy to have another wrapper

const showAlg: Expr<string> = { ... };
const show = (value: (algebra: Expr<string>) => string): string => value(showAlg);

From a theory side of things, this is just a regular encoding for existential type through double negation of universal quantification.

Now to more obscure parts of this encoding. You could have mentioned that show would display 1 + 2 + 3 instead of 1 + (2 + 3), and we'll make a scarecrow issue out of it. Let's add some context that keeps a boolean flag of whether subexpression should be wrapped into parens.

const showAlg: Expr<(topLevel: boolean) => string> = {
    num: (value) => () => `${value}`,
    add: (left, right) => (topLevel) => {
        const tmp = `${left(false)} ${right(false)}`;
        return topLevel ? tmp : `(${tmp})`;
    },
};
const show = (value) => value(showAlg)(true);

That kind of Reader monad pattern happens most of the times we intend to use tagless final encoding, and in fact it's very often that we want to add a few other values into our context. In order not to face the tedium of adjusting every single place of our program, it's best to make it an object the first time the code is implemented

type FooContext = { flag: boolean }; // only one... for now
type Foo = (ctx: FooContext) => string;
const fooAlg: Expr<Foo> = { ... };

The best part of the pattern (and in fact the reason it was initially invented) comes from modularity of these interpreters. We have both sequential and parallel composition here:

const showButAll1: Expr<string> = {
   ...show,
   num: () => '1',
};

const withLogging = <T>({ num, add }: Expr<T>): Expr<(log: () => string) => T> => ({
    num: (value) => (log) => { log(value.tostring()); return num(value); },
    add: (left, right) => (log) => add(left(log), right(log)),
});
const showWithLog = withLogging(show);

const translate = <T>(algebra: Expr1<T>): Expr2<T> => ({ ... });

const biinterpret = <T, U>(algebra1: Expr<T>, algebra2: Expr<U>): Expr<[T, U]> => ({ ... });

and also we can extend (or even reduce) the language without modifying any and all of the previously working code. Together these two solve the extension problem

interface ExtendedExpr<T> extend Expr<T> {
    mul: (left: T, right: T) => T;
}

const showExtended: ExtendedExpr<string> = {
    ...show,
    mul: (left, right) => `${left} * ${right}`,
};

// none of original code was changed

Another thing where the pattern excels is enforcing implementer of interpreter to list all the properties of ast nodes if by agreement (or linter rule) all the new properties are added as first arguments. When another field is introduced to ast node, it's usually way too easy to forget to handle it somewhere.

I suspect biinterpret(evaluate, generate) is somewhat close to what is going on in this PR (yet I didn't properly read it).

When this pattern does not work that well:

  • If we need any operations with more than 1 argument (for example, equality), it takes huge effort to encode it, and even then we're guaranteed (proofs omitted) to spend at least O(N^2) operations. I wish Kiselyov mentioned it in original paper instead of a passing mention in related source code in his blog :/
  • If we have to consider more than one node at a time. Usually it signifies we have our ast types wrong, but we already have codebase with whatever ast is there, and I suspect you're not a fan of the idea of refactoring it. It's definitely possible to use context to keep the information of nodes around the current. Like, when we want to do something on node of type B inside of node of type A, we just add insideA: boolean flag. It's just a bit tedious, so general rule is not to use tagless final whenever lots of pattern-matching is expected. I should probably mention there is a Boehm-Berarducci encoding for arbitrary pattern-matching, but it doesn't look very practical to me.
  • The approach is best suited for "everything is an expression" kind of language. If there are ast types with drastically different models (expressions and statements, for example), you'd have types of Expr<E> & Stmt<S, E> kind (expressions get interpreted as E, statements as S, and they also have to know what expressions are modelled as). If the set of concrete interpreters is expected to have a lot of different models for different ast nodes, types can get quite peculiar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I described in some detail what might be done here, let's come back to the topic of OOP and where it makes sense to use it (I mean, anything that has class or new in it).

The short answer would be "never", but it's not entirely true. In prototypes js holds only one object with all the method pointers, and when we do not use a class, we have to create a closure object for every method in every instance of the "class". In fact it's an exceptionally rare situation when an object holds more than a couple of methods (very likely they should have been just standalone functions), or where extra memory usage and runtime overhead are not worth extra soundness. It usually happens in a code that operates with billions of objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer would be "never", but it's not entirely true

Sorry if I'll interrupt your thoughts, but I want to add that using a minimal amount of OOP is very handy when dealing with the Object Pool pattern where you minimize new memory allocations and greatly reduce GC spikes by re-using already created objects. And for those to have a method like .reset() that'll operate on the fields is super handy :)

Added lattice for values: this simplified join operations.
@jeshecdom
Copy link
Contributor Author

I still need to explore the refactoring suggested by @verytactical. My main worry right now is that I need to add more tests for the constant propagation analyzer. So, probably I will add first the tests and then start exploring the suggested refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FunC codegen error for obvious cases of zero division in statements
4 participants