-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better unsafe regex detector #28
Comments
@evilpacket I'm happy to submit a PR with a demonstration if desired. |
@evilpacket Sounds good? |
@davisjam You've updated safe-regex since submitting this and I plan to bring in that change. For something that hits an external source I'd like it to be configurable or a different rule and clear as to what is sent to the remote server etc. A lot of people get nervous when code they are scanning starts to get sent to a remote destination, but having this capability would be nice. Would the server side be something that others could also run themselves? |
@evilpacket the gist of it is probably all at https://github.com/davisjam/vuln-regex-detector/ |
Yeah I think best bet here is to not be sending anything to a remote server, and instead implement the caching stuff locally.... |
The server-side code is available, but would have to be run on the user's side. Unfortunately the existing advanced analyses are written in Java and OCaml, not JS. I have half a million labeled regexes, so I suppose another option is to ship this database in |
@davisjam is there a technical reason preventing the tools from being written in JS, or has that work just not been done by anyone? |
The analysis could be written in JavaScript, and I would happily
incorporate it into safe-regex. I can point anyone interested in this in
the right direction.
…On Wed, Sep 11, 2019, 9:51 PM Matthew Herbst ***@***.***> wrote:
@davisjam <https://github.com/davisjam> is there a technical reason
preventing the tools from being written in JS, or has that work just not
been done by anyone?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFOD3L2EY7V6EJKQCHBZAO3QJGOCNANCNFSM4EYDDY7A>
.
|
@davisjam I will introduce my research MakeNowJust-Labo/redos. This is a hybrid ReDoS detection library. "Hybrid" means, it combines automaton based detector and fuzzing based detector. When the RegExp pattern is hard to be checked by the automaton based, it switches to use the fuzzing based. This allows checking many patterns in the real world. This library is written in Scala, but it is compiled to JavaScript by Scala.js. So, you can use this from JavaScript easily. In addition, I published eslint-plugin-redos for easy use. Feel free to try it! |
@makenowjust Thanks for letting me know about it! Looks cool. I looked over the documentation (not the implementation) but I'm not sure exactly what you've made. Are you building on the academic research on the topic (e.g. Shen for fuzzing, Weideman/Wustholz/Rathnayake for NFA analysis)? What guarantees does the library offer? |
@davisjam I'm not sure the below may be an answer to you. Sorry that my English is poor.
More discussions are needed to conclude "what they detect are same.", but I believe both of them can find a vulnerability. And, each of them has pros and cons. For example, automaton based detector can find vulnerability precisely, however it is very slow on several patterns, and hard to support some syntax. On the other hand, fuzzing based detector can support all syntax, and we can control analyzing time easily, however it is hard to avoid false-negative. So, I think it is better to combine both. Fuzzing based detector refers Shen's ReScue, but my implementation is very extended. In fact, ReScue cannot find a polynomial order case, but my implementation can in many cases. I'm preparing a paper about this now. And, I'm going to talk about this at the 62nd programming symposium in Japan (I'll talk in Japanese, sorry). I'm planning to explain detailed ideas (detector selection algorithm, and fuzzing improvements) here. |
@makenowjust Happy to discuss more over email -- I sent you a note a few days ago. |
@nzakas would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch. |
(I own safe-regex. It's very simple, not much to maintain).
ReCheck is employing a more sophisticated algorithm that, based on the docs, should have more true positives and fewer false positives compared to safe-regex.
ReCheck will be more time consuming to run.
One potential check is how ReCheck would perform on a regex heavy codebase (say, the marked project for markdown). Based on my experience with similar tools, a local cache might be helpful for on-workstation use. I'm not sure if ReCheck has one.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Tobias Hernstig ***@***.***>
Sent: Monday, January 30, 2023 4:11:47 AM
To: eslint-community/eslint-plugin-security ***@***.***>
Cc: Davis, James C ***@***.***>; Mention ***@***.***>
Subject: Re: [eslint-community/eslint-plugin-security] Better unsafe regex detector (#28)
---- External Email: Use caution with attachments, links, or sharing data ----
@nzakas<https://github.com/nzakas> would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch.
—
Reply to this email directly, view it on GitHub<#28 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFOD3L5BIMOYH2KTYM7KN7TWU6AVFANCNFSM4EYDDY7A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@davisjam your original post indicated that https://github.com/davisjam/vuln-regex-detector would be a better choice than https://github.com/davisjam/safe-regex, that you are both maintainer of. Your last post in this thread indicated you reached out to @makenowjust but that seem to never have concluded something in this thread at least. The question is then, are both projects worthy replacements and the option should be on the user, or what would be the best fit for eslint-plugin-security? I think most code bases uses very few regexes that the extra time needed should be low enough so that the one with the most true positives and fewest false positives would be the best one to use, but that is my subjective opinion. We benchmarked safe-regex vs recheck and we are only talking milliseconds for some of the more complex regexes in our tests, but they were not exhaustive. |
The goal of the plugin is to try to catch as many problems as possible with as few false positives and false negatives as possible. Anything that moves us further in that direction is worthwhile to investigate. That said, I'm not a security researcher, so I'd lean on the recommendations of folks like @davisjam to decide how to move forward here. |
@nzakas neither are we. We tested all of them and found the one by @makenowjust to have the least false positives and most true positives, but granted we just tested a few regexes. So that is very anecdotal and should not hold any ground for choosing one. |
If you could share your findings in this thread, maybe we can get some folks to review them and verify. |
@nzakas here are our simple results:
Using this script (not the prettiest): import { performance } from 'node:perf_hooks';
import { check, checkSync } from 'recheck';
import safeRegex from 'safe-regex';
// See https://github.com/davisjam/vuln-regex-detector/issues/81
// 'vuln-regex-detector' is not tested as it does not work properly
const regexes = [
{ regex: /^test/, vulnerable: false },
{ regex: /^((a)+)+z$/, vulnerable: true },
{ regex: /^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$/, vulnerable: false },
{ regex: /(\w+)\_(\w+)$/, vulnerable: true },
{ regex: /^[0-9a-f]{4}$/, vulnerable: false },
{ regex: /^[0-9a-f]{4,30}$/, vulnerable: false },
{ regex: /^(?<foo>(?:Foo|Bar))$/, vulnerable: false },
{
regex: /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/,
vulnerable: false,
},
];
const resultPrinter = (result) => (result ? 'safe' : 'vulnerable');
// Call each function once in case they have code just running once on first call
safeRegex('foo');
await check('foo', '');
for (const regex of regexes) {
const pattern = regex.regex.source; // String (RegExp.prototype.source)
let name; // npm package name
let result;
// recheck async
name = 'recheck-async';
performance.mark(`start-${name}`);
result = await check(pattern, '');
performance.mark(`end-${name}`);
performance.measure(`${name} ${pattern}`, {
detail: { name, pattern, result: result.status },
start: `start-${name}`,
end: `end-${name}`,
});
// recheck sync
name = 'recheck-sync';
performance.mark(`start-${name}`);
result = checkSync(pattern, '');
performance.mark(`end-${name}`);
performance.measure(`${name} ${pattern}`, {
detail: { name, pattern, result: result.status },
start: `start-${name}`,
end: `end-${name}`,
});
// safe-regex
name = 'safe-regex';
performance.mark(`start-${name}`);
result = safeRegex(pattern);
performance.mark(`end-${name}`);
performance.measure(`${name} ${pattern}`, {
detail: { name, pattern, result: resultPrinter(result) },
start: `start-${name}`,
end: `end-${name}`,
});
}
const map = new Map();
performance.getEntriesByType('measure').forEach((measure) => {
if (map.has(measure.detail.pattern)) {
map.get(measure.detail.pattern).push(measure);
} else {
map.set(measure.detail.pattern, [measure]);
}
});
map.forEach((measures, pattern) => {
console.log(`regex: /${pattern}/`);
console.log(
` expected result: ${
regexes.filter((regex) => regex.regex.source === pattern)[0].vulnerable
? 'vulnerable'
: 'safe'
}`
);
measures.forEach((measure) => {
console.log(
` ${measure.detail.name}: ${
measure.detail.result
} (${measure.duration.toFixed(3)} ms)`
);
});
console.log();
}); |
Thanks. For clarity, can you also add a column for expected result? (Useful for places where different checkers return different results.) |
Updated. @davisjam and @makenowjust can you double-confirm the above results? |
We've seen false positives with the unsafe regex rule and named capture groups. Would it be possible to add a row to that table for regular expressions with named capture group(s)? E.g. |
@jgeurts I tried to update the table a few days ago, but ran into bugs with both of the packages when trying again: See: I expect @makenowjust to fix the recheck problem, but I am not sure about vuln-regex-detector as the repository has very low activity. Once I get some fixes, I plan to add a script in this issue that can be used to compare them. I will then add your regexes @jgeurts. |
Thanks @thernstig! We will wait to move forward for that update. |
Looks like @makenowjust has been able to resolve the issue in recheck 🥳 . I do not think I have time to resolve the issue in vuln-regex-detector. FWIW The table of results ("expected result") in #28 (comment) looks correct to me. |
@nzakas @makenowjust @jgeurts I have updated the table. Scrutinize the code used to test if you have time. |
Thanks so much! It does look like recheck is more accurate, but to get a true apples-to-apples comparison, I think we need to run |
|
@nzakas @makenowjust updated the table with So changing will suffer quite a bit of performance, but be more correct. @nzakas I assume it is up to you to decide now what to do with this information. Let me know if there are other packages and/or regexes you want added. |
If asynchronous processing is a bottleneck for us, I think we might be able to use synckit. |
I assume that ESLint intends to allow for asynchronous processing sooner or later? Maybe better focus on fixing that? |
Async rules in ESLint are still in the future, so we can't wait for that. Unfortunately, I don't think we can use recheck right now considering how much slower it is. If it were just a little slower, that would be one thing, but in some cases it's an order of magnitude slower, and I don't think the tradeoff is worth it. |
Let's see if @makenowjust can solve this as he is implementing synckit. If that improves the numbers using |
If that can bring the numbers closer to safe-regex, then definitely. Although it might be worth the effort to see if we can improve safe-regex to be more accurate. |
Looks like there is now also https://github.com/tjenkinson/eslint-plugin-redos-detector, using https://github.com/tjenkinson/redos-detector in case anyone is curious (or looking for alternatives). I haven't measured how redos-detector compares to recheck though. |
Hi all,
I'm a systems/security researcher at Virginia Tech and have been studying the incidence of vulnerable regexes in the wild.
This plugin's unsafe regex detector relies on safe-regex, which uses star height (nested quantifiers) to identify unsafe regexes.
Pros:
As a result, safe-regex is great for CI use cases.
Cons:
(ab+)+
).There are some alternatives to safe-regex that report exploit strings so you can tell if they're correct or not.
Unfortunately:
My project vuln-regex-detector provides a convenient wrapper for these alternatives, and enforces time and memory limits to get results or fail relatively quickly.
However, I'd be surprised if developers were willing to wait even 30 seconds for linting. To address that, I'm nearly done implementing a server side so queries can be answered by hitting the server for a pre-computed answer instead of doing the expensive computation locally. The server processes not-seen-before queries in the background so subsequent queries will get a real answer.
Once that's done, would you folks be interested in hitting my server first and falling back to
safe-regex
if my server hasn't seen the query before? I've got a sample client that can be used with a one-line tweak for this use case.The text was updated successfully, but these errors were encountered: