Skip to content

Tree-sitter rolling fixes, 1.128 edition#1251

Merged
savetheclocktower merged 10 commits into
masterfrom
tree-sitter-1-128
Apr 30, 2025
Merged

Tree-sitter rolling fixes, 1.128 edition#1251
savetheclocktower merged 10 commits into
masterfrom
tree-sitter-1-128

Conversation

@savetheclocktower
Copy link
Copy Markdown
Contributor

Had a couple already in the hopper just from the last couple of days.

JSON plus comments! Big deal! Should've done it ages ago!

…plus an extra “JSON With Comments” grammar that explicitly allows them and is used for `.jsonc` files.

Comments are not allowed in the JSON spec, so not all tooling supports parsing JSON wiht comments. But that limits their usefulness as configuration files. VS Code has popularized JSONC (a flavor of JSON where both line and block comments are expressly permitted) as a config file format.

Currently, we do _nothing_ around comments in JSON files. They are highlighted like plain text. But `tree-sitter-json` tolerates and parses comments!

Because of this, we can support JSONC for free. Hence a new grammar that uses the exact same parser and query files as the current JSON grammar, but where comments are legal.

We can also handle them better in ordinary JSON files: by marking them with the `invalid.illegal` scope, much like we already do with trailing commas.

Some tooling even blurs lines a bit by adding comments to `.json` files. (`tsconfig.json`, from a _different_ Microsoft team, is perhaps the most notable offender.) So we’re also offering a way for users to opt into a mode that tolerates comments. A new “Allow Comments in .json files” setting on the package will stop marking comments as illegal in JSON.

I think this strikes a good balance between being correct and being realistic to the real world.
…when Tree-sitter pretends to see an “optional chain” node in the tree.

Tree-sitter has to make the best of invalid syntax all the time because it reparses even while people are typing. When there are invalidities, it tries a number of strategies to get back to a normal parsing mode as quickly as possible.

The strategies fall into two buckets: (a) pretending that an invalid token isn't there, and (b) pretending something that _ought_ to be there actually _is_. If the latter strategy gets it to a valid tree faster, that's what it'll do.

`tree-sitter-javascript` parses `optional_chain` nodes (the “?.” in `foo?.bar`) in more places than they’re actually valid. They’re not allowed on the left-hand side of an assignment, so when we see one in that context, we add an `invalid.illegal` scope.

But sometimes the `optional_chain` node we see is not actually there because of Tree-sitter’s error-recovery behavior. So we add an `invalid.illegal` scope at a given position, expecting that we'll close it later… but we never do, since the closing scope boundary is at the same spot, and we close scopes before we open them.

There's a larger bug fix that needs to happen to address all _that_, but the simplest fix for now is to enforce the content of the `optional_chain` node so that we don’t try to add a scope to an empty node.
This is the “proper” fix for the issue I described in an earlier commit in which a zero-width capture could “leak” a scope name into everything that comes after it.

Those captures may still be valid for other uses. For instance, indentation resolution uses `ScopeResolver#store` to apply the logic of custom predicates, but does not need to consume the boundary data; it is happy to tolerate empty captures if the grammar author opts into it. That should be unchanged; we just won't add any boundary data that we know will produce invalidity.

This will do the right thing with regard to the syntax highlighting _and_ the “Editor: Log Cursor Scope” command.
Long explanation follows. (Long even by my standards.)

### Overview

When a new grammar is activated — as happens often during window startup — we do something extremely disruptive to the buffer: we take all language layers and make them re-evaluate all their injections across their entire ranges.

Ordinarily, this happens on every keystroke, but the scope of the re-evaluation is limited to the range invalidated by the most recent edit. (In Markdown, if I type a letter inside of a paragraph, the only injection re-evaluated will be that particular paragraph; we can rule out that any other injections are implicated by the change.)

But when a new grammar is added, we need to reconsider any injections that were previously attempted but failed. (If we try to inject the `hyperlink` grammar into Markdown paragraphs, but we’re still starting up and said grammar isn’t active yet, the attempt will fail. But when the `hyperlink` grammar, or any other grammar, is activated, we’ll try again.)

This late activation of a grammar (or sometimes change of an existing grammar) is unpleasant because it makes us throw out some work and try again. One of the main upsides of Tree-sitter is that only the _initial_ parse has to handle the entire buffer; but when we have to redo injection work, it forces a reparse of all layers except the root language layer.

### Definitional crash course

* All Tree-sitter buffers consist of a root `LanguageLayer` that uses the buffer’s base grammar to highlight the entire file.
* If it needs to inject other types of content (HTML needs `script` blocks to contain JS, JS needs `RegExp` literals to contain regex literal highlighting), it creates new child `LanguageLayer`s for specific ranges.
* The language that’s injected can itself inject into sub-ranges of its own layer. Hence an HTML file can inject a JS grammar, which in turn can inject a regex grammar.
* To add an injection to (e.g.) JavaScript, I’d call `atom.grammars.addInjectionPoint('source.js', foo)`, where `foo` is an object that describes
  * which named Tree-sitter nodes to inject into,
  * which language to inject into each one (identified by strings that are matched against grammars in fuzzy pattern), and
  * other options that are unimportant right now.
* Finally, a quick refresher on markers: they track logical ranges of a buffer in a way that tolerates shifts in any direction (because of edits above or below) and even enlargment and shrinking (because of edits within). We use them for many things, but in this case we use them to track logical ranges of injections.
  * A `LanguageLayer` can inject into any number of discrete ranges with gaps in between, but each `LanguageLayer` has a marker that tracks the union of all those ranges and caps via a marker.

### The problem

When a grammar is activated, it can have a catastrophic effect on performance. Re-evaluating each layer in an existing buffer can take a long time; and if the grammar activation order is especially unlucky, the cost must be paid over and over again, as @DeeDeeG found out.

### Stress-testing

Markdown is nearly a worst-case scenario for injections because it uses one parser for block-level language features (paragraphs, lists, blockquotes, headings), then a different one inside each block-level element to handle all inline features (hyperlinks, bold, italics, images). I stress-tested by taking one of my Tree-sitter blog posts and pasting it into the editor again and again until I had a file about 8,000 lines long and with ~4400 injections.

I then simulated the addition/changing of an existing grammar by calling `languageMode.updateForInjection` and passing the grammar in question. This triggered the existing behavior: re-invoking `_populateInjections` for each language layer in the entire buffer.

Testing against `master`, I found that this froze the editor for about _18 seconds_. (Now imagine this happening over and over for a _number_ of grammars during startup!)

### How injections are evaluated

The default state of an injection is quite stable once it’s created. Most buffer changes don’t affect most injections, and typically a given layer only needs a reparse if a change happens inside of it.

To figure out which layers are implicated by a buffer change, we ask the injection marker layer for all markers that intersect the given range. We then filter that range manually to exclude all markers except those that have a `parentLanguageLayer` property equal to the current `LanguageLayer`.

The first half of that — the `MarkerLayer::findMarkers` call — is surprisingly fast, even when there are thousands of injections. The second half is fine when only a few markers are returned, but utter pain when _lots_ of markers are returned.

The rest of the work done in `_populateInjections` is itself somewhat costly, but that’s not the problem; if it needs to happen, it needs to happen. But there’s an opportunity to realize large performance gains here by skipping any _unnecessary_ work.

### The fixes

Hence what I needed to do was

1. reduce the cost for a `LanguageLayer` to retrieve the markers for any child `LanguageLayer`s it created, and
2. find ways to avoid re-assessing injections on layers that are not affected by the addition of a specific grammar.

Point 1 was pretty easy. There’s no particular reason I can think of that we weren’t already storing a `LanguageLayer`’s child layer markers in a data structure on the `LanguageLayer` itself. I copied this approach from the legacy `TreeSitterLanguageMode` class. Maybe the idea was that the marker layer should be the only source of truth.

But ultimately it’s folly for a hierarchical data structure to need to iterate through a flat list to derive its hierarchy. So we’ll define a `childLayerMarkers` set on each `LanguageLayer` and will add/remove layers from it as needed.

Point 2 is much harder. As one of the comments I added explains, several things cause us to need to re-assess injections:

* buffer changes, obviously;
* changes to injection points (e.g., if a newly activated package adds another possible injection to our hypothetical `source.gfm` Markdown grammar);
* changes to our list of grammars (since the new grammar might be a candidate for injecting into children).

We’ve already got the first one sorted.

The second one, perhaps surprisingly, does _not_ trigger automatic re-highlighting, but we can add that in later once this is demonstrated to work. Not hard.

The third one is the hardest of the three, but still doable. Injections describe the language they want to inject with an informal string like `html` or `c` rather than an exact scope ID. So if we want to assess whether a newly added grammar necessitates a re-evaluation of a layer’s injections, we can look at all the language strings it tried to turn into grammars last time around — both the successes and the failures.

Define _SLS_ as all the unique language strings associated with the layer’s current child injection layers. Define _ULS_ as all the unique language strings this layer has ever considered that _would’ve_ resulted in injection layers, but failed because they didn’t match any known grammars. These represent “successful” and “unsuccessful” language strings, respectively.

To determine whether the addition of a grammar _G_ requires the re-evaluation of a layer’s injections, we take the following steps:

* if the new grammar does not define an `injectionRegex`, it cannot be used in injections by definition; we can ignore it on all layers;
* otherwise, if the set of injection points added to the current layer’s grammar has changed since we last evaluated injections, we _must_ re-evaluate the layer’s injections (temporary); 
* otherwise, check each item in _SLS_ and see if `atom.grammars.treeSitterGrammarForLanguageString(item)` matches _G_; if so, re-evaluate the layer;
* otherwise, check each item in _ULS_ and see if `atom.grammars.treeSitterGrammarForLanguageString(item)` matches _G_; if so, re-evaluate the layer;
* otherwise, we can skip calling `_populateInjections` on the layer.

Even in the final lucky case, we must assess each of this layer’s child layers. On `master`, we visit each layer linearly in a loop after calling `getAllLanguageLayers()`; on this PR, because we now keep track of parent/child layer relationships explicitly, we traverse them in a tree-like depth-first order. Either is fine.

Note the item I marked as “temporary” in the list above. When we actually force re-evaluation when injection points are added, we can skip this step. But until that happens, we must consider it when determining if we can take shortcuts, or else we’ll end up with false positives.

### The results

On my sample Markdown file, I tested calling `updateForInjection` with three different grammars:

* a grammar (YAML) that had no presence in the file — this should be very cheap to add;
* a grammar (hyperlink) that had some presence in the file — about 1,400 paragraphs in the sample file had hyperlinks and thus would’ve required a small percentage of injections to be re-evaluated;
* a grammar (markdown-inline) that had a _huge_ presence in the file (~4,000 layers, though requiring only the root language layer to repopulate its injections) and would’ve triggered a large amount of churn.

I benchmarked each by running

```js
console.time('x');
languageMode.updateForInjection(someGrammar);
console.timeEnd('x');
```

On `master`, before any of these changes, such a call took about 18,000ms — eighteen seconds of a frozen editor — no matter which grammar I used.

Taking them in order, here are the results after these changes:

* Calling `updateForInjection` with the YAML grammar cost us ~8ms — a triumph, as it still required iteration through ~4,400 injection layers.
* Calling it with the hyperlink grammar cost us ~514ms as we called `_populateInjections` on ~1,400 layers.
* Calling it with the markdown-inline grammar called `_populateInjections` on one layer and cost us ~800ms.

The last case is worth thinking about. You might well wonder if this is an accurate simulation if these grammars have already been activated and thus can use existing trees. I tested this just in case by _removing_ the injection point in question (the one that injects markdown-inline into markdown-block), closing the file, opening the file (so that I had a file devoid of inline Markdown styling), re-adding adding the injection point, and finally calling `updateForInjection` with the markdown-inline grammar.

These steps didn’t substantially change the benchmark — because the actual _parsing_ is already asynchronous! It took some number of seconds for the inline syntax highlighting to appear, but we weren’t blocked on that work; it yielded the main thread at regular intervals to keep the UI snappy.

That’s what this is about: cutting down on unnecessarily _synchronous_ work. I could spend a lot of time on this and cut it down further — and I may! — but diminishing returns will kick in at some point.

---

Anyway, this is the longest commit message I’ve ever written. It will be supplemented with tests in future commits, but at least now future generations will understand why all this was done. Hopefully they’ll understand why I spent almost as much time writing this message as doing the actual work.
Threatened to do this in The Commit Message That Would Not End; now it’s happening.
…when gathering existing injection markers to re-evaluate.

This was silly of me. When we repopulate injections, we cast as wide a net as possible to make sure we've caught all the existing child `LanguageLayer`s in the area affected by the buffer change. We do this for two reasons:

1. Often we can skip creating a new `LanguageLayer` if the old one is still valid and covers the range we expect.
2. The ones that _don't_ get matched up and reused need to be destroyed!

I wasn't thinking about the second part. I assumed that `MarkerLayer#findMarkers` wouldn't return invalid markers, but it does; and it's a good thing, too, because markers can go from invalid to valid (e.g., when the user navigates backward through the history stack).

Prematurely winnowing the invalid markers out of the set meant that some `LanguageLayer`s did not get destroyed when they should’ve been, leading to some redundant highlighting on inline Markdown code as I was writing it.

I tracked it down by re-introducing the old logic as a sort of heckler from the wings; whenever the new logic produced results inconsistent with what was there before, I had it log to the console and call me a fraud. It worked!
@savetheclocktower savetheclocktower marked this pull request as ready for review April 14, 2025 16:47
@savetheclocktower
Copy link
Copy Markdown
Contributor Author

Release day nears, so I suppose it's time to take this out of draft!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice to have options for this.

(Recapping so this doesn't have zero context in the review overview: Comments in JSON? Handy to have, if supported. Also ILLEGAL and IIRC some software will error/bail upon encountering them. jsonc? Strange but real, apparently.)

(Posting individual comments as I go so it feels like I am making progress, feel free to delay looking at any of these until I'm further along/look at these comments in batches or all at once at the end if you want.)

Comment thread packages/language-json/package.json Outdated
"allowCommentsInJsonFiles": {
"type": "boolean",
"title": "Allow Comments in .json files",
"description": "When checked, will allow comment syntax in all files with the `.json` extension even though it’s not valid JSON. When unchecked, comments in `.json` files will be marked as invalid, and only files with the `.jsonc` extension will be allowed to show comments. (It’s recommended to reload your window or restart Pulsar after changing this setting.)",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a less-technical-user friendlier explanation is possible here?

This is primarily for things like highlighting, code folding, etc, right?

So I might rephrase some of it?

Suggested change
"description": "When checked, will allow comment syntax in all files with the `.json` extension even though it’s not valid JSON. When unchecked, comments in `.json` files will be marked as invalid, and only files with the `.jsonc` extension will be allowed to show comments. (It’s recommended to reload your window or restart Pulsar after changing this setting.)",
"description": "When checked, will allow comments in `.json` files, even though it’s not valid according to the official JSON specification. When unchecked, comments in `.json` files will be marked as invalid, and only files with the `.jsonc` extension will mark and highlight comments as normal/allowed. (It’s recommended to reload your window or restart Pulsar after changing this setting.)",

If this is off-base, then feel free to not change the wording, suggestion was to give an example of how I thought it might be made more approachable, but I am not attached to the specific phrasing in the code suggestion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, and I'll tweak it only slightly from what you've suggested:

When checked, will allow comments in .json files even though they’re not valid according to the official JSON specification. When unchecked, comments in .json files will be marked as invalid, and only files with the .jsonc extension will mark comments as valid and allowed. (It’s recommended to reload your window or restart Pulsar after changing this setting.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've made this change locally, so it'll get addressed once I address any further feedback you have.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have these changes ready to go? I'll probably end up testing the bins and tap out of reading the big commit's message and diff. And everything else seems fine to my eyes. So, I probably won't have much further or detailed feedback beyond this point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed it.

@DeeDeeG
Copy link
Copy Markdown
Member

DeeDeeG commented Apr 14, 2025

Status: I am reading the quite large commit message and its associated diff, taking breaks for food, etc. It may be a bit before I post more.

@savetheclocktower
Copy link
Copy Markdown
Contributor Author

Status: I am reading the quite large commit message and its associated diff, taking breaks for food, etc. It may be a bit before I post more.

Fire a signal flare every 12 hours so I know you're still alive.

…by providing a fallback value for the second argument of `seek`.

Atom IDE has a very creative strategy for doing headless imperative syntax highlighting: it creates a `TextBuffer`, sets the correct grammar and language mode, waits for the contents to parse, then calls `buildHighlightIterator` and seeks manually from point A to point B. You can see this in action in `atom-ide-datatip` and `atom-ide-signature-help`.

But this is an old strategy that assumes we're always working with `TextMateHighlightIterator`s — which need to be told where to start but not where to end.

The `HighlightIterator` used in `WASMTreeSitterLanguageMode` has a second parameter to `seek` that specifies where our highlighting job will end. If this parameter is omitted, syntax highlighting fails.

So we'll recover in these cases by setting our own end row. All of our own usages of `HighlightIterator::seek` supply two arguments, so this is just about supporting a very unusual strategy used by old code.
@DeeDeeG
Copy link
Copy Markdown
Member

DeeDeeG commented Apr 30, 2025

The big slowdown seems to have gotten much faster to resolve (and happens much more interactively rather than locking up the window's whole renderer thread for one huge continuous stretch) when launching Pulsar with a large MarkDown document open, or when activating a language package with such a document open. Assuming it is being parsed as MarkDown and not set to Plain Text which skips the problem entirely.

👍

But yeah, I feel that's a decent way to report some meaningful feedback/review of the big ol' commit that was.

@savetheclocktower
Copy link
Copy Markdown
Contributor Author

The big slowdown seems to have gotten much faster to resolve (and happens much more interactively rather than locking up the main thread) when launching Pulsar with a large MarkDown document open, or when activating a language package with such a document open. Assuming it is being parsed as MarkDown and not set to Plain Text which skips the problem entirely.

I'm thrilled you still have the original conditions available so you can verify this fix.

@DeeDeeG
Copy link
Copy Markdown
Member

DeeDeeG commented Apr 30, 2025

Oh yeah, just booted up Pulsar 1.127.1 again (without this PR's fixes) and it is doing the full-on slowdown again, so at least from very brief testing (two boots, disabling/re-enabling a language package once) this PR is making a very big difference.

Copy link
Copy Markdown
Member

@DeeDeeG DeeDeeG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From some brief testing:

  • App launches and generally works
  • Highlighting and folding in various languages seems to be working well (tested JavaScript, YAML, MarkDown, JSON)
  • Performance in general use feels normal/good

In addition to the stuff I've looked at and commented about above in the thread, this is looking good to me! Thanks again for all the work.

@savetheclocktower
Copy link
Copy Markdown
Contributor Author

@DeeDeeG hollered uncle (and fairly so) on having to read that long commit message. Just to make life easier for our future selves, here's a summary of those changes that spares you from my long-windedness:

  • When you re-open a project, you've likely got some files open already. It's hard to restore syntax highlighting on those files while grammars are still being activated… but it's something we have to do anyway, since new grammars can be activated at any time.
  • Suppose you're reopening a project with a Markdown file open. We'll syntax-highlight the file as soon as the Markdown grammar is activated. In the general case, there's no good reason to wait longer than that.
  • But you might have lots of fenced code blocks in your Markdown file, and those are handled as injections. The first highlighting pass might've happened before some of those grammars were ready. So as other grammars activate themselves, we might need to re-highlight the file to reflect our new capabilities.
  • On current master we're using an extremely blunt tactic: re-process all injections whenever almost any grammar is activated — it just had to be a modern Tree-sitter grammar and be injectable. That's why startup can be so slow for certain kinds of documents — to the point of long freezes.
  • On this PR, we're trying very, very hard to skip doing that extra work. Even a bunch of bookkeeping to determine whether that work needs to be done… is faster than doing that work needlessly.
  • The new approach is to visit each language layer in turn (with an optimized data structure that helps each layer more easily keep track of its child injection layers) and see if the new grammar would've been used instead of the grammar we used last time. We also keep track of the previous update's failed attempts to find grammars (often because they weren't activated yet!) and see if this new grammar would've been used for any of those. If so, we re-parse and re-highlight the layer; if not, we skip it.
  • This makes us much smarter about when to re-parse and re-highlight in response to grammar activations on startup. A freeze of ten or more seconds (enough to make the “we think we're frozen” dialog appear) was, in practice, reduced to under 1 second in the worst case.

@savetheclocktower savetheclocktower merged commit 1375f7d into master Apr 30, 2025
103 checks passed
@savetheclocktower savetheclocktower deleted the tree-sitter-1-128 branch April 30, 2025 20:05
savetheclocktower added a commit that referenced this pull request May 3, 2025
* [language-json] Add optional comment support…

…plus an extra “JSON With Comments” grammar that explicitly allows them and is used for `.jsonc` files.

Comments are not allowed in the JSON spec, so not all tooling supports parsing JSON wiht comments. But that limits their usefulness as configuration files. VS Code has popularized JSONC (a flavor of JSON where both line and block comments are expressly permitted) as a config file format.

Currently, we do _nothing_ around comments in JSON files. They are highlighted like plain text. But `tree-sitter-json` tolerates and parses comments!

Because of this, we can support JSONC for free. Hence a new grammar that uses the exact same parser and query files as the current JSON grammar, but where comments are legal.

We can also handle them better in ordinary JSON files: by marking them with the `invalid.illegal` scope, much like we already do with trailing commas.

Some tooling even blurs lines a bit by adding comments to `.json` files. (`tsconfig.json`, from a _different_ Microsoft team, is perhaps the most notable offender.) So we’re also offering a way for users to opt into a mode that tolerates comments. A new “Allow Comments in .json files” setting on the package will stop marking comments as illegal in JSON.

I think this strikes a good balance between being correct and being realistic to the real world.

* [language-(java|type)script] Fix overeager “invalid” highlighting…

…when Tree-sitter pretends to see an “optional chain” node in the tree.

Tree-sitter has to make the best of invalid syntax all the time because it reparses even while people are typing. When there are invalidities, it tries a number of strategies to get back to a normal parsing mode as quickly as possible.

The strategies fall into two buckets: (a) pretending that an invalid token isn't there, and (b) pretending something that _ought_ to be there actually _is_. If the latter strategy gets it to a valid tree faster, that's what it'll do.

`tree-sitter-javascript` parses `optional_chain` nodes (the “?.” in `foo?.bar`) in more places than they’re actually valid. They’re not allowed on the left-hand side of an assignment, so when we see one in that context, we add an `invalid.illegal` scope.

But sometimes the `optional_chain` node we see is not actually there because of Tree-sitter’s error-recovery behavior. So we add an `invalid.illegal` scope at a given position, expecting that we'll close it later… but we never do, since the closing scope boundary is at the same spot, and we close scopes before we open them.

There's a larger bug fix that needs to happen to address all _that_, but the simplest fix for now is to enforce the content of the `optional_chain` node so that we don’t try to add a scope to an empty node.

* Don't store any scope boundaries if the capture range is empty

This is the “proper” fix for the issue I described in an earlier commit in which a zero-width capture could “leak” a scope name into everything that comes after it.

Those captures may still be valid for other uses. For instance, indentation resolution uses `ScopeResolver#store` to apply the logic of custom predicates, but does not need to consume the boundary data; it is happy to tolerate empty captures if the grammar author opts into it. That should be unchanged; we just won't add any boundary data that we know will produce invalidity.

This will do the right thing with regard to the syntax highlighting _and_ the “Editor: Log Cursor Scope” command.

* Optimize reevaluation of injections when grammars are added or changed

Long explanation follows. (Long even by my standards.)

When a new grammar is activated — as happens often during window startup — we do something extremely disruptive to the buffer: we take all language layers and make them re-evaluate all their injections across their entire ranges.

Ordinarily, this happens on every keystroke, but the scope of the re-evaluation is limited to the range invalidated by the most recent edit. (In Markdown, if I type a letter inside of a paragraph, the only injection re-evaluated will be that particular paragraph; we can rule out that any other injections are implicated by the change.)

But when a new grammar is added, we need to reconsider any injections that were previously attempted but failed. (If we try to inject the `hyperlink` grammar into Markdown paragraphs, but we’re still starting up and said grammar isn’t active yet, the attempt will fail. But when the `hyperlink` grammar, or any other grammar, is activated, we’ll try again.)

This late activation of a grammar (or sometimes change of an existing grammar) is unpleasant because it makes us throw out some work and try again. One of the main upsides of Tree-sitter is that only the _initial_ parse has to handle the entire buffer; but when we have to redo injection work, it forces a reparse of all layers except the root language layer.

* All Tree-sitter buffers consist of a root `LanguageLayer` that uses the buffer’s base grammar to highlight the entire file.
* If it needs to inject other types of content (HTML needs `script` blocks to contain JS, JS needs `RegExp` literals to contain regex literal highlighting), it creates new child `LanguageLayer`s for specific ranges.
* The language that’s injected can itself inject into sub-ranges of its own layer. Hence an HTML file can inject a JS grammar, which in turn can inject a regex grammar.
* To add an injection to (e.g.) JavaScript, I’d call `atom.grammars.addInjectionPoint('source.js', foo)`, where `foo` is an object that describes
  * which named Tree-sitter nodes to inject into,
  * which language to inject into each one (identified by strings that are matched against grammars in fuzzy pattern), and
  * other options that are unimportant right now.
* Finally, a quick refresher on markers: they track logical ranges of a buffer in a way that tolerates shifts in any direction (because of edits above or below) and even enlargment and shrinking (because of edits within). We use them for many things, but in this case we use them to track logical ranges of injections.
  * A `LanguageLayer` can inject into any number of discrete ranges with gaps in between, but each `LanguageLayer` has a marker that tracks the union of all those ranges and caps via a marker.

When a grammar is activated, it can have a catastrophic effect on performance. Re-evaluating each layer in an existing buffer can take a long time; and if the grammar activation order is especially unlucky, the cost must be paid over and over again, as @DeeDeeG found out.

Markdown is nearly a worst-case scenario for injections because it uses one parser for block-level language features (paragraphs, lists, blockquotes, headings), then a different one inside each block-level element to handle all inline features (hyperlinks, bold, italics, images). I stress-tested by taking one of my Tree-sitter blog posts and pasting it into the editor again and again until I had a file about 8,000 lines long and with ~4400 injections.

I then simulated the addition/changing of an existing grammar by calling `languageMode.updateForInjection` and passing the grammar in question. This triggered the existing behavior: re-invoking `_populateInjections` for each language layer in the entire buffer.

Testing against `master`, I found that this froze the editor for about _18 seconds_. (Now imagine this happening over and over for a _number_ of grammars during startup!)

The default state of an injection is quite stable once it’s created. Most buffer changes don’t affect most injections, and typically a given layer only needs a reparse if a change happens inside of it.

To figure out which layers are implicated by a buffer change, we ask the injection marker layer for all markers that intersect the given range. We then filter that range manually to exclude all markers except those that have a `parentLanguageLayer` property equal to the current `LanguageLayer`.

The first half of that — the `MarkerLayer::findMarkers` call — is surprisingly fast, even when there are thousands of injections. The second half is fine when only a few markers are returned, but utter pain when _lots_ of markers are returned.

The rest of the work done in `_populateInjections` is itself somewhat costly, but that’s not the problem; if it needs to happen, it needs to happen. But there’s an opportunity to realize large performance gains here by skipping any _unnecessary_ work.

Hence what I needed to do was

1. reduce the cost for a `LanguageLayer` to retrieve the markers for any child `LanguageLayer`s it created, and
2. find ways to avoid re-assessing injections on layers that are not affected by the addition of a specific grammar.

Point 1 was pretty easy. There’s no particular reason I can think of that we weren’t already storing a `LanguageLayer`’s child layer markers in a data structure on the `LanguageLayer` itself. I copied this approach from the legacy `TreeSitterLanguageMode` class. Maybe the idea was that the marker layer should be the only source of truth.

But ultimately it’s folly for a hierarchical data structure to need to iterate through a flat list to derive its hierarchy. So we’ll define a `childLayerMarkers` set on each `LanguageLayer` and will add/remove layers from it as needed.

Point 2 is much harder. As one of the comments I added explains, several things cause us to need to re-assess injections:

* buffer changes, obviously;
* changes to injection points (e.g., if a newly activated package adds another possible injection to our hypothetical `source.gfm` Markdown grammar);
* changes to our list of grammars (since the new grammar might be a candidate for injecting into children).

We’ve already got the first one sorted.

The second one, perhaps surprisingly, does _not_ trigger automatic re-highlighting, but we can add that in later once this is demonstrated to work. Not hard.

The third one is the hardest of the three, but still doable. Injections describe the language they want to inject with an informal string like `html` or `c` rather than an exact scope ID. So if we want to assess whether a newly added grammar necessitates a re-evaluation of a layer’s injections, we can look at all the language strings it tried to turn into grammars last time around — both the successes and the failures.

Define _SLS_ as all the unique language strings associated with the layer’s current child injection layers. Define _ULS_ as all the unique language strings this layer has ever considered that _would’ve_ resulted in injection layers, but failed because they didn’t match any known grammars. These represent “successful” and “unsuccessful” language strings, respectively.

To determine whether the addition of a grammar _G_ requires the re-evaluation of a layer’s injections, we take the following steps:

* if the new grammar does not define an `injectionRegex`, it cannot be used in injections by definition; we can ignore it on all layers;
* otherwise, if the set of injection points added to the current layer’s grammar has changed since we last evaluated injections, we _must_ re-evaluate the layer’s injections (temporary);
* otherwise, check each item in _SLS_ and see if `atom.grammars.treeSitterGrammarForLanguageString(item)` matches _G_; if so, re-evaluate the layer;
* otherwise, check each item in _ULS_ and see if `atom.grammars.treeSitterGrammarForLanguageString(item)` matches _G_; if so, re-evaluate the layer;
* otherwise, we can skip calling `_populateInjections` on the layer.

Even in the final lucky case, we must assess each of this layer’s child layers. On `master`, we visit each layer linearly in a loop after calling `getAllLanguageLayers()`; on this PR, because we now keep track of parent/child layer relationships explicitly, we traverse them in a tree-like depth-first order. Either is fine.

Note the item I marked as “temporary” in the list above. When we actually force re-evaluation when injection points are added, we can skip this step. But until that happens, we must consider it when determining if we can take shortcuts, or else we’ll end up with false positives.

On my sample Markdown file, I tested calling `updateForInjection` with three different grammars:

* a grammar (YAML) that had no presence in the file — this should be very cheap to add;
* a grammar (hyperlink) that had some presence in the file — about 1,400 paragraphs in the sample file had hyperlinks and thus would’ve required a small percentage of injections to be re-evaluated;
* a grammar (markdown-inline) that had a _huge_ presence in the file (~4,000 layers, though requiring only the root language layer to repopulate its injections) and would’ve triggered a large amount of churn.

I benchmarked each by running

```js
console.time('x');
languageMode.updateForInjection(someGrammar);
console.timeEnd('x');
```

On `master`, before any of these changes, such a call took about 18,000ms — eighteen seconds of a frozen editor — no matter which grammar I used.

Taking them in order, here are the results after these changes:

* Calling `updateForInjection` with the YAML grammar cost us ~8ms — a triumph, as it still required iteration through ~4,400 injection layers.
* Calling it with the hyperlink grammar cost us ~514ms as we called `_populateInjections` on ~1,400 layers.
* Calling it with the markdown-inline grammar called `_populateInjections` on one layer and cost us ~800ms.

The last case is worth thinking about. You might well wonder if this is an accurate simulation if these grammars have already been activated and thus can use existing trees. I tested this just in case by _removing_ the injection point in question (the one that injects markdown-inline into markdown-block), closing the file, opening the file (so that I had a file devoid of inline Markdown styling), re-adding adding the injection point, and finally calling `updateForInjection` with the markdown-inline grammar.

These steps didn’t substantially change the benchmark — because the actual _parsing_ is already asynchronous! It took some number of seconds for the inline syntax highlighting to appear, but we weren’t blocked on that work; it yielded the main thread at regular intervals to keep the UI snappy.

That’s what this is about: cutting down on unnecessarily _synchronous_ work. I could spend a lot of time on this and cut it down further — and I may! — but diminishing returns will kick in at some point.

---

Anyway, this is the longest commit message I’ve ever written. It will be supplemented with tests in future commits, but at least now future generations will understand why all this was done. Hopefully they’ll understand why I spent almost as much time writing this message as doing the actual work.

* (Forgot to rename this method)

* Re-populate injections in existing buffers when injection points change

Threatened to do this in The Commit Message That Would Not End; now it’s happening.

* Add specs

* Don't test for marker validity…

…when gathering existing injection markers to re-evaluate.

This was silly of me. When we repopulate injections, we cast as wide a net as possible to make sure we've caught all the existing child `LanguageLayer`s in the area affected by the buffer change. We do this for two reasons:

1. Often we can skip creating a new `LanguageLayer` if the old one is still valid and covers the range we expect.
2. The ones that _don't_ get matched up and reused need to be destroyed!

I wasn't thinking about the second part. I assumed that `MarkerLayer#findMarkers` wouldn't return invalid markers, but it does; and it's a good thing, too, because markers can go from invalid to valid (e.g., when the user navigates backward through the history stack).

Prematurely winnowing the invalid markers out of the set meant that some `LanguageLayer`s did not get destroyed when they should’ve been, leading to some redundant highlighting on inline Markdown code as I was writing it.

I tracked it down by re-introducing the old logic as a sort of heckler from the wings; whenever the new logic produced results inconsistent with what was there before, I had it log to the console and call me a fraud. It worked!

* Fix issue with Atom IDE usage of `HighlightIterator`…

…by providing a fallback value for the second argument of `seek`.

Atom IDE has a very creative strategy for doing headless imperative syntax highlighting: it creates a `TextBuffer`, sets the correct grammar and language mode, waits for the contents to parse, then calls `buildHighlightIterator` and seeks manually from point A to point B. You can see this in action in `atom-ide-datatip` and `atom-ide-signature-help`.

But this is an old strategy that assumes we're always working with `TextMateHighlightIterator`s — which need to be told where to start but not where to end.

The `HighlightIterator` used in `WASMTreeSitterLanguageMode` has a second parameter to `seek` that specifies where our highlighting job will end. If this parameter is omitted, syntax highlighting fails.

So we'll recover in these cases by setting our own end row. All of our own usages of `HighlightIterator::seek` supply two arguments, so this is just about supporting a very unusual strategy used by old code.

* Address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants