Future-proof "Scopes" encoding #158

szuend · 2024-11-21T11:19:42Z

@jridgewell and I brainstormed a bit how we could make the "Scopes" encoding forward compatible without making it too complex or heavy handed. I'll implement some of these in the coming days to get some numbers to compare.

Future-proof "Scopes" encoding

The current "Scopes" encoding is not ideal w.r.t. to future extension:

Adding new fields to OriginalScope and GeneratedRange in a backwards compatible way is impossible. Any tool implementing the current proposal would break once we add new optional fields to either data structure.
The encoding uses the , and ; characters on top of base64 encoded VLQ numbers. Moving to a future binary source map format will require a different encoding for "Scopes" to account for , and ;.

We should aim for an encoding that is both forwards-compatible and is purely VLQ based: So the only difference between the current JSON source map format and a potential future binary format is how VLQs are encoded.

The crux of the issue is to find the right balance between

retaining some flexibility for future extensions without going overboard (e.g DWARF-style encoding),
encoding/decoding complexity,
and encoded size.

This document proposes some potential "Scopes" encodings that keep both goals in mind while aiming for a healthy balance.

Grammar

The encoding formats are presented in a EBNF-like grammar with:

there is only one terminal: a VLQ. Each terminal is labelled and we denote them with uppercase (e.g. TERMINAL is a VLQ with the label 'TERMINAL').
non-terminals denoted with snake case (e.g. non_term).
symbol* means zero or more repetitions of symbol.
symbol? means zero or one symbol.
symbol[N] means N occurrences of symbol.

Option A - Prefix items with their length

original_scopes = (LENGTH original_item)*

original_item = original_start_item | original_end_item

original_start_item =
    LINE
    COLUMN
    FLAGS
    NAME? // present if FLAGS<0> is set
    KIND? // present if FLAGS<1> is set
    VARIABLE_COUNT
    VARIABLE[VARIABLE_COUNT]

original_end_item =
    LINE
    COLUMN

generated_ranges = (LENGTH generated_item)*

generated_item = generated_start_item | generated_end_item

generated_start_item =
    COLUMN   // the actual value is COLUMN<1:n>.
    LINE?    // if COLUMN<0> is set.
    FLAGS
    DEFINITION_SOURCE_OFFSET?  // present if FLAGS<0> is set
    DEFINITION_ITEM_OFFSET?    // present if FLAGS<0> is set
    CALL_SITE_SOURCE?          // present if FLAGS<1> is set
    CALL_SITE_LINE?            // present if FLAGS<1> is set
    CALL_SITE_COLUMN?          // present if FLAGS<1> is set
    BINDING_COUNT
    binding[BINDING_COUNT]

binding =
    EXPR_OR_SUB_RANGE_LENGTH   // -1 = not available, >=0 offset into "names"
    EXPR_0?                    // present if EXPR_OR_SUBRANGE_LENGTH < -1.
    sub_range_binding[-EXPR_OR_SUBRANGE_LENGTH - 1]

sub_range_binding =
    LINE
    COLUMN
    EXPR

generated_end_item =
    COLUMN   // the actual value is COLUMN<1:n>.
    LINE?    // if COLUMN<0> is set.

This is identical to the current proposal modulo:

Each item is prefixed with the number of VLQs in the item
Variables in OriginalScope and bindings in GeneratedRange are prefixed with their length
columns in the generated range encode whether a line VLQ is present or not

original_start_item and original_end_item are distinguished by their length: A "end" item always has 2 VLQs while a "start" item has at least 3.
generated_start_item and generated_end_item are distinguished by their length: A "end" item has 1 or 2 VLQs while a "start" item has at least 3.

Option B - Add "remaining" count in the presence of unknown flags

Similar to Option A, so we'll list only the changed productions:

original_scopes = original_item*

original_start_item =
    LINE
    COLUMN
    FLAGS
    NAME? // present if FLAGS<0> is set
    KIND? // present if FLAGS<1> is set
    VARIABLE_COUNT
    VARIABLE[VARIABLE_COUNT]
    REMAINING?  // present if FLAGS<n:3> is not zero.
    REST[REMAINING]

generated_ranges = generated_item*

generated_start_item =
    COLUMN   // the actual value is COLUMN<1:n>.
    LINE?    // if COLUMN<0> is set.
    FLAGS
    DEFINITION_SOURCE_OFFSET?  // present if FLAGS<0> is set
    DEFINITION_ITEM_OFFSET?    // present if FLAGS<0> is set
    CALL_SITE_SOURCE?          // present if FLAGS<1> is set
    CALL_SITE_LINE?            // present if FLAGS<1> is set
    CALL_SITE_COLUMN?          // present if FLAGS<1> is set
    BINDING_COUNT
    binding[BINDING_COUNT]
    REMAINING?  // present if FLAGS<n:4> is not zero.
    REST[REMAINING]

Advantages over Option A:

We only pay the price of encoding the item length once we actually add new fields
Variables/bindings are not included, so REMAINING stays small even for scopes/ranges with lots of variables

Quirks:

Adding new marker flags to FLAGS (not new fields) requires generators to emit a REMAINING value of 0.

Option C - Tag-Length-Value

Similar to Option A but we prefix each item not only with it's length but a tag as well. The advantages are:

We can encode scopes and ranges in one blob. That is the JSON could have a single "scopes" field containing the combination of "originalScopes" and "generatedRanges".
Start/end items can be distinguished by their tag.
We keep the door open for not only extending original_start_item and generated_start_item, but adding new item types all-together.

Since it's similar to option A, we'll list only the changed productions:

scopes = items*

item =
      "0x1" LENGTH original_start_item
    | "0x2" LENGTH original_end_item
    | "0x3" LENGTH generated_start_item
    | "0x4" LENGTH generated_end_item

Option C2 - DWARF-style zero entries

This is a variant to Option C. Instead of using original_start_item and original_end_item, we combine both into a original_item. Similar to DWARF, nesting is achieved by using a special tag to denote the end of a item's children.

item =
      "0x0"
    | "0x1" LENGTH original_item
    | "0x2" LENGTH generated_item

original_item =
    START_LINE
    START_COLUMN
    END_LINE
    END_COLUMN
    FLAGS
    // ...

generated_item =
    START_COLUMN
    START_LINE?
    END_COLUMN
    END_LINE?
    FLAGS
    // ....

Example of nested scopes (tags only): [0x1, ...<length + content>, 0x1, ...<length + content>, 0x0, 0x0].

This comes with some special rules if we don't want to lose the efficiency of relative line/column numbers for start and end locations:

A scopes or ranges' start location is relative to the preceding siblings' end location, or the parents' start location if it's the first child.
A scopes or ranges' end location is relative to it's last child's end location, or it's start location if it does not have any children.

There is also the question of START_LINE, and END_LINE in generated_item. We could encode it's presence in FLAGS or use the LSB of the respective *_COLUMN.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future-proof "Scopes" encoding #158

Future-proof "Scopes" encoding #158

szuend commented Nov 21, 2024 •

edited

Loading

Future-proof "Scopes" encoding #158

Future-proof "Scopes" encoding #158

Comments

szuend commented Nov 21, 2024 • edited Loading

Future-proof "Scopes" encoding

Grammar

Option A - Prefix items with their length

Option B - Add "remaining" count in the presence of unknown flags

Option C - Tag-Length-Value

Option C2 - DWARF-style zero entries

szuend commented Nov 21, 2024 •

edited

Loading