Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create initial distributions for turtle, sparql and pie #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rossbowen
Copy link

@rossbowen rossbowen commented Jul 28, 2022

I've been following progress on these for a little while - I'm interested in getting distributions for turtle and SPARQL highlighting working.

I've taken look at the advice over at the highlight.js repo and gotten this into a working state. There's an index.html included in this PR which demonstrates the highlighting for each of pie, turtle and SPARQL.

Running the highlight.js tests shows a few failures, namely the turtle grammar causing "exponential" and "polynomial" backtracking, and some other languages getting misrecognised as pie. I haven't fixed these for now and would need some help from the maintainers if these do need to be fixed.

But seems to work well enough for now!

@joshgoebel
Copy link
Member

joshgoebel commented Jul 28, 2022

causing "exponential" and "polynomial" backtracking

Those can lead to pretty serious security issues if highlighting user provided code (because they can craft damaging payloads)... and stability issues even if you know the code isn't malicious because you could still run into sample code that triggers the backtrack that would lock up your browser, or Node.js, etc... they are the things someone really should focus on understanding and fixing - though it's not always easy...

@@ -6,7 +6,7 @@ Description: Terse RDF Triple Language for the semantic web
Website: https://www.w3.org/TR/turtle/
*/

function(hljs) {
module.exports = function(hljs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably if we're building with Highlight.js built tool this would all be ES6 modules (export keyword, not `module.exports), not CJS exports... The build system currently supports both, but that may not always be the case.

We'll likely still spit out "web" (hljs global) and ESM builds into dist for a while, I'm specifically talking about the source files in src/languages here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I had been looking at some other language extensions and think I was mostly seeing commonJS exports, but I can make the switch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well of course this has implications if someone is going to use the package directly as well, but Highlight.js is going ESM only with v12... so I think that is where everything is headed for sure. And many people already switched last year even.

@@ -48,7 +48,7 @@ function(hljs) {
};

var LANGTAG = {
begin: /@[a-zA-Z]+([a-zA-Z0-9-]+)*/,
begin: /@[a-zA-Z]+(-[a-zA-Z0-9]+)*/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this fix a backtrack?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fixing a backtrack though this is actually the cause of the exponential backtracking. Both versions cause it.

The spec for turtle actually has it as:

[144s] LANGTAG '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*

Which I think makes sense - we should match @en-us for example but not @en--us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both versions cause it.

Are you sure? Can you past the exact error with help text?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the new one did fix a backtrack actually! My mistake.

function(hljs) {
var ttl = hljs.getLanguage('ttl').exports;
module.exports = function(hljs) {
var ttl = require("./turtle")(hljs).exports;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit strange... traditionally you'd use getLanguage, letting us resolve the language at runtime... but that does require someone load the language dependencies correctly (our build tool does this when building a web monolith). Doing it this way is going to increase the size of the builds since now sparql will include it's own copy of ttl instead of just referring to the "already loaded" one.

Was this an intentional change and you're aware of the pros/cons and think we need to build it this way on purpose?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. I'm thinking although SPARQL and turtle are closely related that you wouldn't necessarily include them both at all times.

I hadn't considered the size cost of the builds - mostly I was thinking it might feel more natural to have SPARQL be able to be stand-alone, though if this could present a problem I could include some README instruction that for SPARQL to work the user must also register turtle.

I'm open to opinions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you wouldn't necessarily include them both at all times.

That's reasonable enough logic for me... we can see what the maintainers think.

@rossbowen
Copy link
Author

rossbowen commented Jul 29, 2022

Cheers for the review Josh - I'm certainly up for trying to understand it further. For now, this is the output I'm getting from the tests - the quantifiers are pretty verbose and I'm struggling to even map these back to where they appear in the turtle grammar.

1) turtle
       should not cause exponential backtracking:
     Error: turtle/[1]/begin: The quantifier `([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*` ambiguous for all words "\\-".repeat(n) for any n>1. This will cause exponential backtracking.

To fix this issue, you have to rewrite the element (let's call it E) of the quantifier. The goal is modify E such that it is disjoint with repetitions of itself. This means that if a (sub-)string is matched by E, then it must not be possible for E{2}, E{3}, E{4}, etc. to match that (sub-)string.

Example: `(?:\w+|::)+`
The problem lies in `\w+` because `\w+` and `(?:\w+){2}` are not disjoint as the string "aa" is fully matched by both. In this example, the pattern by easily fixed by changing `\w+` to `\w`.
In the real world, patterns can be a lot harder to fix. If you are trying to make the tests pass for a pull request but can't fix the issue yourself, then make the pull request (or commit) anyway. A maintainer will help you.

Full pattern:
/\^\^([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F]([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])?/im

turtle/[4]/begin: The quantifier `([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*` ambiguous for all words "\\-".repeat(n) for any n>1. This will cause exponential backtracking.

To fix this issue, you have to rewrite the element (let's call it E) of the quantifier. The goal is modify E such that it is disjoint with repetitions of itself. This means that if a (sub-)string is matched by E, then it must not be possible for E{2}, E{3}, E{4}, etc. to match that (sub-)string.

Example: `(?:\w+|::)+`
The problem lies in `\w+` because `\w+` and `(?:\w+){2}` are not disjoint as the string "aa" is fully matched by both. In this example, the pattern by easily fixed by changing `\w+` to `\w`.
In the real world, patterns can be a lot harder to fix. If you are trying to make the tests pass for a pull request but can't fix the issue yourself, then make the pull request (or commit) anyway. A maintainer will help you.

Full pattern:
/([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F]([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:(([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])?)?/im

2) turtle
       should not cause polynomial backtracking:
     Error: turtle/[1]/begin: Polynomial backtracking. By repeating any character that matches /:/i, an attack string can be created.

    [-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Full pattern:
/\^\^([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F]([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])?/im
                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

turtle/[4]/begin: Polynomial backtracking. By repeating any character that matches /:/i, an attack string can be created.

    [-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:(([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Full pattern:
/([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F]([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.]*[-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀])?)?:(([A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_:0-9]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀.:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])*([-A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ-῿‌-‍⁰-↏Ⰰ-⿯、-퟿豈-﷏ﷰ-�က0-F_0-9·̀-ͯ‿-⁀:]|%[0-9A-Fa-f][0-9A-Fa-f]|\\[_~.!$&'()*+,;=/?#@%-])?)?/im
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@joshgoebel
Copy link
Member

I'll try to briefly explain in my own words, but you may need to turn to Google and read up on disjoined expressions and polynomial and exponential backtracking. It can be complex. The help that's outputted is actually pretty good...

You're trying to avoid things like:

(a+)(a+)F

a (one or many times) followed by a one or many times, then f... Regex isn't as smart as you think... it's walking the string really fast... it's not matching it like your brain might... so if you have something like:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX

There are a whole bunch of ways this MIGHT match... (a)(aaaaaaaaaaaaaaaaaaaaaaaaa) or (aaaaaaaaaaaaaaaaaaaaaaa)(a) or anything in between... it has to try them ALL before it hits X and learns there actually is no match (even though at first it really looks like there might be)...

In this case there are only 20 options though, so that's not so bad... but now take something like:

(a+)*F

Again given:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX

Now, yikes... there are 2^20 different ways this could match... it could match (a)(a)(a).... (aaaaaa)(aaaaaa)(aaaaaa) (a)(aa)(aaa) and every possible combo you can imagine... and it will try them ALL, always looking for the F that it will never find. If the string was instead aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaF it would match crazy fast, cause whatever it tried first would match... this is why these regex are so dangerous... because often you're writing them to match... and not thinking about what happens if they FAIL to match - or just run across the wrong text content purely by accident.


The simplest advice I can give is to try to make sure that your regex always matches unambiguously... that each segment is "disjoint"... that the regex engine only has one way it can pattern match - and that was is always forward (never needing to backtrack)...

@joshgoebel
Copy link
Member

I'm struggling to even map these back to where they appear in the turtle grammar.

It tells you:

turtle/[4]/begin:

The 5th (4+1, 0 index) mode in the turtle grammar, in the begin regex...

    contains: [
      LANGTAG, // 0
      DATATYPE, // 1
      IRI_LITERAL, // 2
      BLANK_NODE, // 3
      PNAME, // 4

So PNAME is the culprit.

@joshgoebel
Copy link
Member

There are some great articles out there on this stuff... really we should find some and add them to the Wiki.

@niklasl
Copy link
Contributor

niklasl commented Feb 23, 2025

Is any progress on this planned? I've been adapting this code for use in the new RDF specs (which use ReSpec). I've added some TriG and triple term + annotation support which I'd prefer to contribute here rather than maintain separately. (This brings #1 to mind though.)

There are already other forks of this to make it work with current highlightjs (either CommonJS as here, or, which I presume is preferred, proper ES6, also migrating from the custom exports in favor of that). Is there anything I can do to help?

@rossbowen
Copy link
Author

@niklasl it's been a long time since I've looked at this. At the time I was a bit lost looking at the output of the tests but Josh's explanations above were very helpful. I'd welcome contributions or using this code however you see fit; now I've been reminded this exists I'll do some thinking on it!

Feels like there's two things outstanding here:

  • Swapping to ES6.
  • Fixing the exponential and polynomial backtracking warnings from the tests.

I know GitHub does highlight turtle and SPARQL. When I went digging I found that their github-linguist repo was set up to use this as the basis of the highlighting. I wondered whether that might be a source of inspiration on how to handle the backtracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants