Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflowError from pattern input #56

Open
spacether opened this issue Aug 11, 2020 · 2 comments
Open

StackOverflowError from pattern input #56

spacether opened this issue Aug 11, 2020 · 2 comments

Comments

@spacether
Copy link

Thank you so much for making this tool!
When testing it, I ran into this case that causes a stack overflow:

String pattern = "^[a-zA-Z\s]*$";
Generex generex = new Generex(pattern);
String firstMatch = generex.getFirstMatch();

And when that code is run I get this exception:

Exception in thread "main" java.lang.StackOverflowError
	at java.base/java.util.HashMap$HashIterator.<init>(HashMap.java:1475)
	at java.base/java.util.HashMap$KeyIterator.<init>(HashMap.java:1514)
	at java.base/java.util.HashMap$KeySet.iterator(HashMap.java:912)
	at java.base/java.util.HashSet.iterator(HashSet.java:173)
	at java.base/java.util.AbstractCollection.toArray(AbstractCollection.java:184)
	at dk.brics.automaton.State.getSortedTransitionArray(Unknown Source)
	at dk.brics.automaton.State.getSortedTransitions(Unknown Source)
	at com.mifmif.common.regex.Generex.prepareTransactionNodes(Generex.java:265)

Could this be fixed?

@HawkSK
Copy link

HawkSK commented Aug 16, 2020

getFirstMatch() JavaDoc:

first string in lexicographical order that is matched by the given pattern.

Which means it tries to sort ALL of the generated matches. Your regex is infinite (it has the Kleene star), so ofc you get a SO. The solution is to use the lazy iterator:

String pattern = "^[a-zA-Z\\s]*$";
Generex generex = new Generex(pattern);
final Iterator matchesGenerator = generex.iterator();
if (matchesGenerator.hasNext()) {
	String firstMatch = matchesGenerator.next();
	System.out.println(firstMatch); // ^\t$
}

The output is however probably not what you wanted, because ^ and $ are not special characters in the used grammar (https://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html). Omitting the anchors should make no difference, I think the regex matches the whole string anyway.
As you can see it is not identical to Java regexes, but close enough... Even though there are special characters: "@~&<#. They are marked optional in the used Automaton, however Generex uses all of them by default, sadly. In my fork I added the option to turn them off with the NONE flag - you could clone & install my devel branch if you want to try it.

@spacether
Copy link
Author

spacether commented Aug 16, 2020

Thank you for explaining what's happening with this input.
I understand that this is caused by the infinite regex.
In my opinion an infinite regex should still have a deterministic first match and should not cause a stack overflow.
For my use case, using this different package better meets my needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants