-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create initial distributions for turtle, sparql and pie #5
base: master
Are you sure you want to change the base?
Conversation
Those can lead to pretty serious security issues if highlighting user provided code (because they can craft damaging payloads)... and stability issues even if you know the code isn't malicious because you could still run into sample code that triggers the backtrack that would lock up your browser, or Node.js, etc... they are the things someone really should focus on understanding and fixing - though it's not always easy... |
@@ -6,7 +6,7 @@ Description: Terse RDF Triple Language for the semantic web | |||
Website: https://www.w3.org/TR/turtle/ | |||
*/ | |||
|
|||
function(hljs) { | |||
module.exports = function(hljs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably if we're building with Highlight.js built tool this would all be ES6 modules (export keyword, not `module.exports), not CJS exports... The build system currently supports both, but that may not always be the case.
We'll likely still spit out "web" (hljs
global) and ESM builds into dist
for a while, I'm specifically talking about the source files in src/languages
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I had been looking at some other language extensions and think I was mostly seeing commonJS exports, but I can make the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well of course this has implications if someone is going to use the package directly as well, but Highlight.js is going ESM only with v12... so I think that is where everything is headed for sure. And many people already switched last year even.
@@ -48,7 +48,7 @@ function(hljs) { | |||
}; | |||
|
|||
var LANGTAG = { | |||
begin: /@[a-zA-Z]+([a-zA-Z0-9-]+)*/, | |||
begin: /@[a-zA-Z]+(-[a-zA-Z0-9]+)*/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this fix a backtrack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fixing a backtrack though this is actually the cause of the exponential backtracking. Both versions cause it.
The spec for turtle actually has it as:
[144s] LANGTAG '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
Which I think makes sense - we should match @en-us
for example but not @en--us
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both versions cause it.
Are you sure? Can you past the exact error with help text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the new one did fix a backtrack actually! My mistake.
function(hljs) { | ||
var ttl = hljs.getLanguage('ttl').exports; | ||
module.exports = function(hljs) { | ||
var ttl = require("./turtle")(hljs).exports; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit strange... traditionally you'd use getLanguage
, letting us resolve the language at runtime... but that does require someone load the language dependencies correctly (our build tool does this when building a web monolith). Doing it this way is going to increase the size of the builds since now sparql will include it's own copy of ttl
instead of just referring to the "already loaded" one.
Was this an intentional change and you're aware of the pros/cons and think we need to build it this way on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you. I'm thinking although SPARQL and turtle are closely related that you wouldn't necessarily include them both at all times.
I hadn't considered the size cost of the builds - mostly I was thinking it might feel more natural to have SPARQL be able to be stand-alone, though if this could present a problem I could include some README instruction that for SPARQL to work the user must also register turtle.
I'm open to opinions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you wouldn't necessarily include them both at all times.
That's reasonable enough logic for me... we can see what the maintainers think.
Cheers for the review Josh - I'm certainly up for trying to understand it further. For now, this is the output I'm getting from the tests - the quantifiers are pretty verbose and I'm struggling to even map these back to where they appear in the turtle grammar.
|
I'll try to briefly explain in my own words, but you may need to turn to Google and read up on disjoined expressions and polynomial and exponential backtracking. It can be complex. The help that's outputted is actually pretty good... You're trying to avoid things like:
There are a whole bunch of ways this MIGHT match... (a)(aaaaaaaaaaaaaaaaaaaaaaaaa) or (aaaaaaaaaaaaaaaaaaaaaaa)(a) or anything in between... it has to try them ALL before it hits X and learns there actually is no match (even though at first it really looks like there might be)... In this case there are only 20 options though, so that's not so bad... but now take something like:
Again given:
Now, yikes... there are 2^20 different ways this could match... it could match (a)(a)(a).... (aaaaaa)(aaaaaa)(aaaaaa) (a)(aa)(aaa) and every possible combo you can imagine... and it will try them ALL, always looking for the The simplest advice I can give is to try to make sure that your regex always matches unambiguously... that each segment is "disjoint"... that the regex engine only has one way it can pattern match - and that was is always forward (never needing to backtrack)... |
It tells you:
The 5th (4+1, 0 index) mode in the turtle grammar, in the
So PNAME is the culprit. |
There are some great articles out there on this stuff... really we should find some and add them to the Wiki. |
Is any progress on this planned? I've been adapting this code for use in the new RDF specs (which use ReSpec). I've added some TriG and triple term + annotation support which I'd prefer to contribute here rather than maintain separately. (This brings #1 to mind though.) There are already other forks of this to make it work with current highlightjs (either CommonJS as here, or, which I presume is preferred, proper ES6, also migrating from the custom |
@niklasl it's been a long time since I've looked at this. At the time I was a bit lost looking at the output of the tests but Josh's explanations above were very helpful. I'd welcome contributions or using this code however you see fit; now I've been reminded this exists I'll do some thinking on it! Feels like there's two things outstanding here:
I know GitHub does highlight turtle and SPARQL. When I went digging I found that their |
I've been following progress on these for a little while - I'm interested in getting distributions for turtle and SPARQL highlighting working.
I've taken look at the advice over at the highlight.js repo and gotten this into a working state. There's an
index.html
included in this PR which demonstrates the highlighting for each of pie, turtle and SPARQL.Running the highlight.js tests shows a few failures, namely the turtle grammar causing "exponential" and "polynomial" backtracking, and some other languages getting misrecognised as pie. I haven't fixed these for now and would need some help from the maintainers if these do need to be fixed.
But seems to work well enough for now!