Skip to content

Conversation

i582
Copy link

@i582 i582 commented Sep 22, 2025

Source maps are a key component of many tools such as debuggers, code coverage, and gas profilers, as they enable mapping the actual executed TVM instructions to the source code in a high-level language.

This PR adds a source map implementation for Tolk and defines a common source map format that can be used for other languages (e.g., FunC). source-map-schema-v1.json describes the schema of the generated source maps.

What do we want to get?

Before we move on to the implementation description, let's understand what we're trying to achieve.

A source map allows us to precisely understand which instructions were generated for a specific line in a high-level language, as well as which high-level language lines correspond to the selected instruction.

This mapping can be represented as follows:

Screenshot 2025-09-22 at 14 37 47

Here, each color on the left and right describes a mapping between high-level code and assembly (instructions).

Therefore, we want to generate a JSON file in the compiler that would allow us to implement this.

To do this, we need to map a part of high-level language code (not necessarily an entire line, as a line can contain multiple expressions) to instructions in the bitcode. Moreover, the instruction must be described not simply as a name, since there may be multiple such instructions within the code, but as a position in the resulting bitcode.

Now let's look at how this can be implemented and what problems need to be solved.

Implementation

Tolk uses Fift to compile assembly into bitcode. Fift itself is written in a very complex way. This imposes limitations on how source maps can be implemented.

Let's start with the basics: the DEBUGMARK instruction. The DEBUGMARK instruction is a special instruction that isn't included in the TVM, but describes a specific location in high-level language code. In assembly, it might look like this:

DEBUGMARK 1
GETGLOB 1
DEBUGMARK 2
GETGLOB 2
DEBUGMARK 3
ADD

Each DEBUGMARK starts a new section, where all instructions following it in the current continuation will be associated with that section of code. This way, we know that GETGLOB 1 was generated from section ID=1, and ADD from section ID=3.

Having data for each section, such as the row and column from which the code was generated, we can understand which instructions were generated for each component of the glob_a + glob_b expression:

glob_a + glob_b
^      ^ ^ 
|      | GETGLOB 2
|      |
|      ADD
|
GETGLOB 1

However, in general, this is insufficient, as we need to know not only the instruction the high-level code corresponds to, but also its exact location in the bitcode. This is necessary, for example, for a debugger, since in TVM we step through instructions, so we need to know what high-level language code is currently being executed. In TVM, the bitcode location is described by two components: the hash of the Cell in which the instruction is located and the offset from the start of that Cell.

And here's where the difficulties begin.

The Tolk compiler generates Fift-assembly code, meaning we have no control over how the code is compiled. The most we can do within the Tolk compiler is insert text labels (DEBUGMARK instructions) in the appropriate places. But Fift doesn't recognize this instruction, just like TVM.

Fift

The most obvious, yet most difficult, approach is to add DEBUGMARK support to Fift, simply skip these instructions during compilation, and write information about the current Cell and its offset to an internal array. This way, after compilation, we'll get clean code, as well as a correspondence between the ID and the instruction position in the bitcode.

As simple and elegant as this solution may sound, there's a catch: implementing it in Fift is extremely difficult due to the language's inherent limitations, and maintaining such code is practically impossible. Furthermore, it's difficult to guarantee that adding support won't break core functionality. We also have plans to implement dedicated debug code via special instructions, and supporting it in Fift will be incredibly complex, while the chosen solution described below will be quite simple.

Therefore, we decided to implement this via an external assembler TASM. TASM is an assembler and disassembler implementation for TVM, which has been tested on over 100,000 contracts from a real blockchain. In the future, TASM will be rewritten in C++ and integrated into the Tolk compiler as a replacement for Fift. Currently, it will be used in the tolk-js package as a second compilation step in debug mode, but more on that later.

TASM

Why TASM? Simply because it's feasible to maintain, as it's implemented in pure TypeScript, any changes are clear and local, and the likelihood of breaking something while implementing DEBUGMARK support is minimal.

Let's look at how this will work via TASM. First, Fift still needs to be taught to compile code with DEBUGMARK, but for now we can simply embed it as an instruction into the bitcode without processing it in any way. Thus, at the output of Fift, we'll get bitcode that retains the DEBUGMARK instructions.

This code can't be executed directly (and tolk-js will return it as a separate field and compile it as a separate step, so as not to change the code's behavior during normal compilation), since TVM doesn't recognize the DEBUGMARK instruction.

This is where TASM comes into play. TASM already supports DEBUGMARK, so such code will be correctly disassembled, and DEBUGMARK will be treated as a regular instruction. After decompilation, we compile the code back to bitcode, but now, during compilation, thanks to DEBUGMARK, we build the required ID -> instruction position mapping in bitcode.

This solution has some peculiarities that need to be considered:

  • Until TASM is implemented in the compiler, source map generation will only be possible via talk-js
  • DEBUGMARK takes up space during Fift compilation, meaning there will be more ref {}, as less actual code will fit in a single cell, requiring more references and nesting.

The first problem is less significant, as the main development is done through Blueprint and Sandbox, both tools for the JS ecosystem.

The second problem raises more questions: changing the code this way affects the gas (opening a reference => +100 gas). Fortunately, TASM can compile code without references. This means that if, after compiling with DEBUGMARK, we have a reference in the main cell containing the code, then upon recompilation, when DEBUGMARK is only processed but not embed in the bitcode, this reference will be removed, since the code will be placed in the root cell. TASM implements the same logic for reference formation as Fift.

Therefore, the code after recompiling with TASM will match the code we would get from Fift if we compiled the Tolk code without DEBUGMARK. This makes this solution possible, since we want to debug exactly the same code that will be deployed to production.

Complex code

Tolk has much more different constructs than FunC, so a single line of Tolk code can generate multiple assembly instructions. It's not uncommon to see something like this in the final assembly in source map mode:

DEBUGMARK 1
DEBUGMARK 2
DEBUGMARK 3
EQUAL
...

This means that for a complex expression, we have multiple sections, but they actually represent a single chunk of assembly code. In this case, EQUAL will have three debug sections, and the tools using the source map will have to decide how to display this case.

For example, the debugger in this case can perform pseudo-steps, that is, first display the position in the high-level language code that corresponds to DEBUGMARK 1, then for 2, 3, and only then move on to the next real instruction. This way, we can achieve exceptionally accurate execution path display even with aggressive optimizations and inlining.

Source map representation

This PR only includes the implementation of source map generation, which contains the ID -> information about the corresponding high-level language code. This source map does not know anything about the actual instructions or their mapping. This mapping will be implemented in a PR for tolk-js, which will use TASM and a this source map obtained from the Tolk compiler. See ton-blockchain/tolk-js#13.

The part implemented in this PR includes a description of the location in the high-level language code, the variables available at that location, and additional information about the function where the code is located, including information about the function's inline status. See the schema for a full description.

Example of locations:

{
  "idx": 2,
  "loc": {
    "file": "/Users/petrmakhnev/ton-for-tolk/crypto/smartcont/tolk-stdlib/common.tolk",
    "line": 928,
    "col": 4,
    "line_offset": 1,
    "length": 1
  },
  "vars": [],
  "context": {
    "ast_kind": "ast_return_statement",
    "func_name": "MapLookupResult<slice>.loadValue",
    "func_inline_mode": 0
  }
},
{
  "idx": 3,
  "loc": {
    "file": "/Users/petrmakhnev/ton-for-tolk/test.tolk",
    "line": 19,
    "col": 4,
    "line_offset": 0,
    "length": 1
  },
  "vars": [
    {
      "name": "a",
      "type": "int"
    },
    {
      "name": "b",
      "type": "int"
    }
  ],
  "context": {
    "is_entry": true,
    "ast_kind": "ast_function_declaration",
    "func_name": "some_func",
    "func_inline_mode": 4
  }
},
{
  "idx": 4,
  "loc": {
    "file": "/Users/petrmakhnev/ton-for-tolk/test.tolk",
    "line": 20,
    "col": 13,
    "line_offset": 0,
    "length": 1
  },
  "vars": [
    {
      "name": "a",
      "type": "int"
    },
    {
      "name": "b",
      "type": "int"
    }
  ],
  "context": {
    "ast_kind": "ast_function_call",
    "func_name": "some_func",
    "func_inline_mode": 4
  }
},

- Store global variables locations
- Store events
- Store "is_temporary" flag for variables
- Store condition of assert
- Overall better naming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant