[Tolk] Source maps implementation #1811
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Source maps are a key component of many tools such as debuggers, code coverage, and gas profilers, as they enable mapping the actual executed TVM instructions to the source code in a high-level language.
This PR adds a source map implementation for Tolk and defines a common source map format that can be used for other languages (e.g., FunC). source-map-schema-v1.json describes the schema of the generated source maps.
What do we want to get?
Before we move on to the implementation description, let's understand what we're trying to achieve.
A source map allows us to precisely understand which instructions were generated for a specific line in a high-level language, as well as which high-level language lines correspond to the selected instruction.
This mapping can be represented as follows:
Here, each color on the left and right describes a mapping between high-level code and assembly (instructions).
Therefore, we want to generate a JSON file in the compiler that would allow us to implement this.
To do this, we need to map a part of high-level language code (not necessarily an entire line, as a line can contain multiple expressions) to instructions in the bitcode. Moreover, the instruction must be described not simply as a name, since there may be multiple such instructions within the code, but as a position in the resulting bitcode.
Now let's look at how this can be implemented and what problems need to be solved.
Implementation
Tolk uses Fift to compile assembly into bitcode. Fift itself is written in a very complex way. This imposes limitations on how source maps can be implemented.
Let's start with the basics: the
DEBUGMARK
instruction. TheDEBUGMARK
instruction is a special instruction that isn't included in the TVM, but describes a specific location in high-level language code. In assembly, it might look like this:Each
DEBUGMARK
starts a new section, where all instructions following it in the current continuation will be associated with that section of code. This way, we know thatGETGLOB 1
was generated from section ID=1, andADD
from section ID=3.Having data for each section, such as the row and column from which the code was generated, we can understand which instructions were generated for each component of the
glob_a + glob_b
expression:However, in general, this is insufficient, as we need to know not only the instruction the high-level code corresponds to, but also its exact location in the bitcode. This is necessary, for example, for a debugger, since in TVM we step through instructions, so we need to know what high-level language code is currently being executed. In TVM, the bitcode location is described by two components: the hash of the Cell in which the instruction is located and the offset from the start of that Cell.
And here's where the difficulties begin.
The Tolk compiler generates Fift-assembly code, meaning we have no control over how the code is compiled. The most we can do within the Tolk compiler is insert text labels (
DEBUGMARK
instructions) in the appropriate places. But Fift doesn't recognize this instruction, just like TVM.Fift
The most obvious, yet most difficult, approach is to add
DEBUGMARK
support to Fift, simply skip these instructions during compilation, and write information about the current Cell and its offset to an internal array. This way, after compilation, we'll get clean code, as well as a correspondence between the ID and the instruction position in the bitcode.As simple and elegant as this solution may sound, there's a catch: implementing it in Fift is extremely difficult due to the language's inherent limitations, and maintaining such code is practically impossible. Furthermore, it's difficult to guarantee that adding support won't break core functionality. We also have plans to implement dedicated debug code via special instructions, and supporting it in Fift will be incredibly complex, while the chosen solution described below will be quite simple.
Therefore, we decided to implement this via an external assembler TASM. TASM is an assembler and disassembler implementation for TVM, which has been tested on over 100,000 contracts from a real blockchain. In the future, TASM will be rewritten in C++ and integrated into the Tolk compiler as a replacement for Fift. Currently, it will be used in the
tolk-js
package as a second compilation step in debug mode, but more on that later.TASM
Why TASM? Simply because it's feasible to maintain, as it's implemented in pure TypeScript, any changes are clear and local, and the likelihood of breaking something while implementing DEBUGMARK support is minimal.
Let's look at how this will work via TASM. First, Fift still needs to be taught to compile code with
DEBUGMARK
, but for now we can simply embed it as an instruction into the bitcode without processing it in any way. Thus, at the output of Fift, we'll get bitcode that retains theDEBUGMARK
instructions.This code can't be executed directly (and
tolk-js
will return it as a separate field and compile it as a separate step, so as not to change the code's behavior during normal compilation), since TVM doesn't recognize theDEBUGMARK
instruction.This is where TASM comes into play. TASM already supports
DEBUGMARK
, so such code will be correctly disassembled, andDEBUGMARK
will be treated as a regular instruction. After decompilation, we compile the code back to bitcode, but now, during compilation, thanks toDEBUGMARK
, we build the required ID -> instruction position mapping in bitcode.This solution has some peculiarities that need to be considered:
talk-js
DEBUGMARK
takes up space during Fift compilation, meaning there will be moreref {}
, as less actual code will fit in a single cell, requiring more references and nesting.The first problem is less significant, as the main development is done through Blueprint and Sandbox, both tools for the JS ecosystem.
The second problem raises more questions: changing the code this way affects the gas (opening a reference => +100 gas). Fortunately, TASM can compile code without references. This means that if, after compiling with
DEBUGMARK
, we have a reference in the main cell containing the code, then upon recompilation, whenDEBUGMARK
is only processed but not embed in the bitcode, this reference will be removed, since the code will be placed in the root cell. TASM implements the same logic for reference formation as Fift.Therefore, the code after recompiling with TASM will match the code we would get from Fift if we compiled the Tolk code without
DEBUGMARK
. This makes this solution possible, since we want to debug exactly the same code that will be deployed to production.Complex code
Tolk has much more different constructs than FunC, so a single line of Tolk code can generate multiple assembly instructions. It's not uncommon to see something like this in the final assembly in source map mode:
This means that for a complex expression, we have multiple sections, but they actually represent a single chunk of assembly code. In this case,
EQUAL
will have three debug sections, and the tools using the source map will have to decide how to display this case.For example, the debugger in this case can perform pseudo-steps, that is, first display the position in the high-level language code that corresponds to
DEBUGMARK 1
, then for 2, 3, and only then move on to the next real instruction. This way, we can achieve exceptionally accurate execution path display even with aggressive optimizations and inlining.Source map representation
This PR only includes the implementation of source map generation, which contains the ID -> information about the corresponding high-level language code. This source map does not know anything about the actual instructions or their mapping. This mapping will be implemented in a PR for
tolk-js
, which will use TASM and a this source map obtained from the Tolk compiler. See ton-blockchain/tolk-js#13.The part implemented in this PR includes a description of the location in the high-level language code, the variables available at that location, and additional information about the function where the code is located, including information about the function's inline status. See the schema for a full description.
Example of locations: