-
Notifications
You must be signed in to change notification settings - Fork 12
Evaluate Usage of Atom's Highlights #14
Comments
How is the language support?Excellent. It seems like highlights itself supports language definitions either as TextMate grammar or as CommonJS packages with grammar files (cson/json). |
How is the styling support?Good. But most themes for Atom are using LESS. The syntax highlighting definitions look like this index.less. |
Does it support hooks?Probably. The main highlighter class is quite short (<100 lines), see highlights.coffee. Similar code would have to be written to create a transform method which can be used by grock to highlight vinyl file objects. It may be possible to detect and extract comments in code by adding a check in the loop to render tokens in scopes, as a comment would be a specific scope. That way, the beginning of a comment could be used to start a new file segment (each segment contains arrays of comments and code lines). E.g., in the Lua grammar, the scopes |
I just implemented code splitting and highlighting with highlights. It's still quite rough around the edges but it renders grock's own documentation (js/coffee/md, thin style). Code is in the feature/14-highlights branch. |
Benchmark: Rendering grock's Documentation
Considering grock takes approx. 1.1s to load on my machine, highlights seems to need at least 70% more time (times are averages of ~20 runs each). Also worth noticing: The output size of the complete docs directory (incl. assets) is 339kB (84kb as zip) for highlight.js and 1.2MB (108kB as zip) for highlights. The latter seems to be putting everything it knows into the markup, producing truckloads of nested span tags. |
While fiddling around with the code on this branch trying to see how much of the highlight.js and I ran some tests to get some numbers. Like @killercup, my tests are based on rendering Grock's documenation. I used the following command to generate the docs: ./bin/grock --style=thin Running the command outputs info about the build like so:
We're only interested in the amount of time it took to generate, so I used the following awk code to extract the number of milliseconds it took to generate the documentation from Grock's output: awk '/Done. Generated in/ { print $5 * ($6 == "s" ? 1000 : 1) }' I wanted to be able to compare my results with those which @killercup derived, so I've also averaged 20 iterations. Here's the full code I used for the tests: cat /dev/null > /tmp/grock.out.txt;
for i in $(seq 1 20); do
./bin/grock --style=thin | awk '/Done. Generated in/ { print $5 * ($6 == "s" ? 1000 : 1) }' | tee -a /tmp/grock.out.txt;
done;
awk 'NR == 1 { max=$1; min=$1; sum=0 }
{ if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;}
END {printf "Min: %d ms.\tMax: %d ms.\tAverage: %.1f ms.\n", min, max, sum/NR}' /tmp/grock.out.txt;
rm -f /tmp/grock.out.txt NOTE: The Min/Max/Avg is from this answer to Is there a way to get the min, max, median, and average of a list of numbers in a single command? on Unix & Linux Stack Exchange. Running the script on the
On @killercup's machine this value was 1.90s, or 1900ms, giving a k value between the two machines of:
My next test was on the
If we multiply 1065.0ms by the k value from above,
we get 2021ms, which is only 81% of @killercup's measurement of 2490ms. I would have assumed the values would be closer, so I can only speculate as to why there is such a big discrepancy (my tests are writing to an SSD, and atom/highlights is generating 3-4x more data, so it may be due to file I/O.) Continuing on, I stripped out highlight.js from the branch using the following steps:
Running the tests again gives:
This gives the fastest generation times yet. For one last test, I upgraded the atom/highlights package in
This yields the slowest generation times yet. I wonder what's happened to atom/highlights to make the latest version 175ms. slower? I'll run some more tests. It's probably worth trying a few of the more recent versions of atom/highlights to see if they're all as slow as version 1.2.0, and whether there's been a trend over time. I'd also like to see how hard it would be to modify atom/highlights so the renderer didn't output a sea of superfluous span tags. If atom/highlights ends up being peppy enough, and some of its output tags can be trimmed back, then it might be worth re-opening this issue. |
@jonruttan, thank you very much for your wonderful investigation! If using highlights increases the compile time by 20% but makes grock's code much cleaner and easier to extend (e.g. by having the option to include custom language specifications from Atom), we should use it. Far more concerning to me is the amount of output highlights generates. (What follows is speculation. I may be completely wrong. I haven't looked at the code in about a year.) I noticed that The first to split code from comments (as the name suggests), the second to highlight the code. Ideally, a file should only be tokenized once, then split into segments, and then the sub-trees of tokens should be rendered. The biggest problem with this will be that breaking code at seemingly arbitrary points can lead to invalid highlighting. Do you think it might be possible to use highlights in such a way that we can insert split points that preserve the scope the code is currently in? Essentially, I'm talking about rewriting this part of highlights' |
It may also be interesting to look more deeply at the implementation of first-mate (which provides the grammar registry and parsing hightlights uses). Skipping highlights and using first-mate directly might also allow us to skip the dependency on season, fs-plus and underscore-plus. |
You're welcome @killercup, the tests are not a problem. Thanks for maintaining the project. With this extra info I dug a little further into the code in the I looked into your question about using highlights in such a way that we can insert split points that preserve the scope the code is currently in, and I don't think it looks like it would be too hard. I also did a little work on reducing the size of the generated documentation. As is, the generated documentation is 1236KiB. By removing the dispensable span tag being added on Line #114 of transforms/splitCodeAndComments it shrunk to 1052KiB. With a quick hack to atom/highlights' |
Thats great news, @jonruttan! Are you going to continue working on this? I would love to see a pull request with your changes! (Even work-in-progress, you can just make it merge to the
Are the 339kB for highlight.js I quoted above still valid? That means the newly generated code will be about 3 times larger. We should recommend using a compressed file system in the Readme ;) Edit: I totally forgot to reopen the issue earlier. |
I'll definitely continue working on this – there appear to be many upsides to using using atom/highlights, and after poking my nose into the code yesterday I'm sure we have a few avenues for remedying the size/performance issues it introduces.
Sure thing. If I can get what I'm working on finished up I'll push these changes out in the next few hours.
No, it's grown a bit to 476KiB, so the newly generated code is a little over ~2x larger.
lol, that's one way of dealing with it. |
Hi @killercup. I've made some progress on this front, but I wasn't comfortable pushing my changes to the repo because the hacks I'd made to the code broke too many tests. Last week I spent some more time analysing the code, and in the end I decided to forego atom/highlights and use atom/first-mate instead. The lack of SoC between the Lexer and the Renderer sections within the There is some handy helper code in the Highlights class which loads the TextMate grammar files. Rather than use first-mate directly, I'm breaking that helper code out into a Lexer class. Like first-mate, the Lexer's main function will take text as input but it will just return the tokenized lines instead of rendering them. Complementary to the Lexer, the Renderer's job will be to take tokenized lines as input, and return the rendered output. Between the Lexer and the Renderer, we can split the tokenized lines into two streams, further parse the comments, then render each — or at least that's the plan…I'll keep you posted. |
Thanks for the update, @jonruttan. That sounds like a great plan so far. Don't worry about breaking tests, by the way. A lot of them are just testing the current transforms and you can just remove them (or replace them with equivalent ones for your new transforms). |
Hi @killercup, I got the Lexer class done up and published to npm. It's at https://github.com/jonruttan/textlex. No Grock docs for it yet though 😉 I'll get started on the renderer next. The work for it is pretty much the same as what was needed to get the Lexer published, so it shouldn't take too long. I haven't looked into all of the details yet, but now that we've got the lexer, we can probably use it to do the Markdown parsing as well. My textlex package has a CLI interface, so you can use that with some Markdown files if you'd like to see some examples of the tokens it outputs. |
@jonruttan, sorry it took me two weeks to respond. Your textlex looks quite nice! (And reminded me that I wanted to compile grock's CoffeScript sources at
Have you started with this yet? It might make sense to have a separate module for the renderer as well, it's quite a big part of grock and isolating that (with clean code and better tests) would make grock itself quite simple. (You can just take my code and put it into a new repo if you like, I don't mind.) |
Hi @killercup,
np, I've been really, really busy with some other work for the last two weeks, but your timing is perfect, I was just about to do some work on this.
Thanks for the complement!
Yes, it's done and just needs to be published.
Yeah, I agree, and I've been working towards that. I build the renderer just like atom/first-mate, but in reverse. There's a directory of json/cson renderers in a format which mirrors the TextMate grammars. Right now I have a plain text and an HTML renderer, but making new formats is really easy. I'll get the code published and then you can have a look. There's still a few more modules I need to complete before we can incorporate it into Grock, though. Luckily they're all pretty trivial — a published module for each renderer with a suite of tests, and a companion module for this new renderer which performs a similar task to what TextLex is doing, but for handling a repository of renderers instead of grammars. |
Sounds great, @jonruttan! Just a quick question: How easy is it for a user to add a custom grammar module? There are a ton of those for Atom, and ideally I'd like to offer users the choice to not just use those we consider worthy but add their own; using e.g. |
Trivial, each one is just a serialisable JavaScript object, typically a JSON/CSON file. The first-mate module has the code for a Grammar, and a GrammarRegistry class. The GrammarRegistry maintains the master list of Lexer grammars, and it's used to search through the grammars to find the best match for the file type being parsed. When the best-match is found, the Registry instantiates a Grammar object initialised with the patterns and rules from the JSON/CSON Grammar file loaded into it, ready to start parsing a file of that type. First-mate doesn't ship with any Grammars, nor does it maintain any repositories, so one of the jobs the TextLex module is performing is to deal a set of grammars as sub-modules and registering them in the GrammarRegistry. Right now it just scans an individual directory for extra grammars — the CLI version allows specifying a custom path to use — but it could easily be changed to search multiple paths, use a hand-curated list, or some combination of the two. |
Something else that First-Mate has is Grammar injections, which seem to be Grammars within other Grammars. If I've correctly interpreted what they're meant to do, I think we might be able to get the Lexer to parse the source code and the Markdown comments all in one pass. Seems worth finding out more about how they work, so I'll do that when I've gotten these other jobs done. |
All of the modules for the renderer have been finished and published. Now I can start incorporating them into Grock. You asked earlier:
The jonruttan/textlex specs have an example of doing this (the code was inherited from atom/highlights) with And here's an example using npm install -g textlex
npm install -D lee-dohm/language-r
echo "cat('Hello, world\\\\n')" | textlex -i ../node_modules/language-r/grammars -f r [
[
{
"value": "cat(",
"scopes": [
"source.r"
]
},
{
"value": "'",
"scopes": [
"source.r",
"string.quoted.single.r",
"punctuation.definition.string.begin.r"
]
},
{
"value": "Hello, world",
"scopes": [
"source.r",
"string.quoted.single.r"
]
},
{
"value": "\\n",
"scopes": [
"source.r",
"string.quoted.single.r",
"constant.character.escape.r"
]
},
{
"value": "'",
"scopes": [
"source.r",
"string.quoted.single.r",
"punctuation.definition.string.end.r"
]
},
{
"value": ")",
"scopes": [
"source.r"
]
}
]
] Piping the JSON tokens through my newest module jonruttan/delexe renders it, in this case to HTML: npm install -g delexe
echo "cat('Hello, world\\\\n')" | textlex -i ../node_modules/language-r/grammars -f r | delexe -f html Outputs: <pre class="editor editor-colors"><div class="line"><span class="source r">cat(<span class="string quoted single r"><span class="punctuation definition string begin r">'</span>Hello, world<span class="constant character escape r">\n</span><span class="punctuation definition string end r">'</span></span>)</span></div></pre> |
Ver nice, @jonruttan! I'm looking forward to reading your grock implementation! Let me know if you need any help! (In the meantime, I think I've also mastered Gitter's notification settings so I'll react to mentions on https://gitter.im/killercup/grock :)) |
Atom's syntax highlighting stack has been open sourced. This post gives an overview.
Mainly, highlights is of relevance to grock, as it could replace highlight.js.
Important questions to answer:
Excellent.
Good.
Probably.
The text was updated successfully, but these errors were encountered: