Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #12 -- Allowed memory size of 134217728 bytes exhausted. #34

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

kaby76
Copy link

@kaby76 kaby76 commented Dec 21, 2022

Summary

This is a fix for Issue #12. The main problem was due to some confusion over Map, Set, ===, hash(), and equals(), which is easy to get wrong because these data structures are pretty complicated. (In fact, I think I may have the equivalent() method wrong in the anonymous class in ATNConfigSet.)

Checklist

I've tested this with grammars-v4/aql. The parse completes for for.aql now, and the parse tree is the same between PHP and CSharp.

I will be checking the parser ATN trace for all grammars in grammars-v4 and examples.

@kaby76
Copy link
Author

kaby76 commented Dec 23, 2022

There doesn't seem to be any assert() calls in the PHP runtime--at all. For example, here in C# tests for some sanity in the stream, but it is missing here in PHP. How can I guarantee code safe?

@marcospassos
Copy link
Collaborator

PHP does have support for assertions:
https://www.php.net/manual/en/function.assert.php

But I'd recommend using exceptions for that case.

@kaby76
Copy link
Author

kaby76 commented Dec 24, 2022

I'm trying to fix the ATNConfigSet configLookup data structure, and realizing that what I thought was a Map isn't a Map like Java and C#. Consequently, when I tried to update equivalent() to be something akind to pointer equivalence, nothing worked. So, I read the source code in the Antlr PHP runtime, OpenJDK HashMap, and Dotnet Dictionary.

Map uses an Equivalence, but this--strangly--inherits from Equatable.

interface Equivalence extends Equatable
{
    public function equivalent(Hashable $left, Hashable $right): bool;
    public function hash(Hashable $value): int;
}

interface Equatable
{
    public function equals(object $other): bool;
}

Now, HashMap<> in Java uses equals(); Dictionary<> in the Dotnet runtime also uses Equals(). There is no such thing as the "Equivalence" class in C# or Java. There is no equivalent() method to cloud which comparison method to use/define, and determine what is the difference between the two.

In the Antlr PHP runtime, when entering a value into a Map, it uses equivalent(), not equals() That's explains why my changes to equivalent() for phpstan broke my fix.

Can someone tell me what in the world equals() is supposed to do in relation to equivalent()? Are they different or the same? Map doesn't even use equals().

@marcospassos
Copy link
Collaborator

PHP does not have native support for logical value comparison. So, hash-based data structures need to know how to compare the value for equality. That's where Equivalence comes in.

For example, to reproduce Java’s behavior, one needs to implement an Equivalence that just delegates the check to the equals method. But it also allows to have custom logic in case the equals doesn't meet the requirement.

@kaby76
Copy link
Author

kaby76 commented Dec 24, 2022

PHP does not have native support for logical value comparison. So, hash-based data structures need to know how to compare the value for equality. That's where Equivalence comes in.

For example, to reproduce Java’s behavior, one needs to implement an Equivalence that just delegates the check to the equals method. But it also allows to have custom logic in case the equals doesn't meet the requirement.

Thanks, got it.

So, I know what Equivalence.equivalent() is supposed to do, and I also know what Equivalence.hash() is supposed to do. But what is Equivalence.equals() supposed to do?

In fact, as a check, I just removed the inheritance in Equivalence to Equatable, removed the implementations for Equivalence.equals() and it all works fine. equals() is never called in a Map or Set. And that's the only places where Equivalence are used.

Actually, Equivalence.equals() is referenced in Set here

public function equals(object $other): bool
. But, it can be removed, and it still all works fine.

I must say, phpstan and phpcs are really, really essential. Great tools..

@marcospassos
Copy link
Collaborator

marcospassos commented Dec 24, 2022

Equivalence is an object as any other object. The equal method just tells whether an equivalence relation is equal to another. In other words, a relation R is said to be equal to another relation W if, and only if, they are the same class and impose the same equivalence relation. In practice, for the implementation shipped in the target, it means being in the same class (since they are final classes).

@marcospassos
Copy link
Collaborator

marcospassos commented Dec 24, 2022

I must say, phpstan and phpcs are really, really essential. Great tools

PHPStan is one of the most advanced analysis tool for PHP. It provides static-time support for advanced types like generics, conditional types (like TS), and some other capabilities that even the Java compiler doesn't support. Reason why I maintain the PHP extension for Antlr.

@marcospassos
Copy link
Collaborator

In some places, we're using anonymous classes for implementing custom equivalences. That may be related to the problem. If so, it's just a matter of extracting the class into named classes.

The point is that two sets are only equivalent if they have the same elements and the same equivalence relation. That's why the equals method in Equivalence is essential.

@kaby76
Copy link
Author

kaby76 commented Dec 26, 2022

I haven't yet found the bug with the circular pointer chain issue. But, I'm making my way through the g4 grammars. It seems that for many cases, the code is 20% faster, sometimes more. That said, there is some odd behavior in that some inputs for a grammar are very fast, e.g., ~0.3s, and then other inputs for the same grammar are really slow, like 5-100s depending on the grammar. I think it has to do with handling ambiguity.

For example, for agc input ASSEMBLY_AND_OPERATION_INFORMATION.agc, the parse is a reasonable 0.4s. But, for PINBALL_GAME_BUTTONS_AND_LIGHTS.agc, it takes a staggering 88s (or 120s for current "dev" branch). The lab.antlr.org tester says "line_instruction" alt 6 is very ambiguous, with "max k"=12, but executed ~2000 times.

@kaby76
Copy link
Author

kaby76 commented Jan 1, 2023

The code is still having some ATN trace differences: css3, bootstrap-theme.min.css. Need to fix these first before continuing with performance.

@kaby76
Copy link
Author

kaby76 commented Jan 2, 2023

I still do not know why, but I know where there's a divergence between C# and PHP. It is in the stack merging of sub-parsers (see Section 5.1 of https://www.antlr.org/papers/allstar-techreport.pdf). The parent chains between the two targets look the same, but are not. In other words, the merge caches differ between targets. The merge cache is implemented in CSharp as MergeCache, which contains a field Dictionary<PredictionContext, Dictionary<PredictionContext, PredictionContext>>. In PHP, it is implemented as DoubleKeyMap. This code is placed strangely in src/Utils/ as though it is a general data structure, but only stores PredictionContext, and only used for subparser merging (in C#, it is in the atn/ code). I am not confident whether the C# or PHP code is working correctly yet as this code is pretty complicated.

@marcospassos
Copy link
Collaborator

Maybe compare to a third, battle-tested target? I always take the Java target as the ground truth.

@kaby76
Copy link
Author

kaby76 commented Jan 2, 2023

Maybe compare to a third, battle-tested target? I always take the Java target as the ground truth.

Java and CSharp ATN traces are in complete agreement (~446 MB output size).

@kaby76
Copy link
Author

kaby76 commented Jan 3, 2023

I decided to just pull out the -trace option on the parsing test and just compare the parse trees between CSharp, Java, and PHP. Yes, it looks like PHP isn't working correctly sometimes. Java and CSharp are in agreement.

~/issues/issue-3462/kaby76/css3/Generated-PHP ~/issues/issue-3462/kaby76/css3
PHP 0 ../examples/mdn-at-counter-style.css success 14.5657256
Total Time: 14.6478964
dos2unix: converting file o.txt to Unix format...
~/issues/issue-3462/kaby76/css3
~/issues/issue-3462/kaby76/css3/Generated-CSharp ~/issues/issue-3462/kaby76/css3
CSharp 0 ../examples/mdn-at-counter-style.css success 0.1630958
Total Time: 0.2249674
dos2unix: converting file o.txt to Unix format...
~/issues/issue-3462/kaby76/css3
1c1
< (stylesheet ws (nestedStatement (counterStyle @counter-style (ws  ) (ident circled-alpha) (ws  ) { (ws \n  ) (declarationList (declaration (property_ (ident system) ws) : (ws  ) (expr (term (ident fixed) ws))) ws ; (ws \n  ) (declaration (property_ (ident symbols) ws) : (ws  ) (expr (term (ident Ⓐ) (ws  )) (term (ident Ⓑ) (ws  )) (term (ident Ⓒ) (ws  )) (term (ident Ⓓ) (ws  )) (term (ident Ⓔ) (ws  )) (term (ident Ⓕ) (ws  )) (term (ident Ⓖ) (ws  )) (term (ident Ⓗ) (ws  )) (term (ident Ⓘ) (ws  )) (term (ident Ⓙ) (ws  )) (term (ident Ⓚ) (ws  )) (term (ident Ⓛ) (ws  )) (term (ident Ⓜ) (ws  )) (term (ident Ⓝ) (ws  )) (term (ident Ⓞ) (ws  )) (term (ident Ⓟ) (ws  )) (term (ident Ⓠ) (ws  )) (term (ident Ⓡ) (ws  )) (term (ident Ⓢ) (ws  )) (term (ident Ⓣ) (ws  )) (term (ident Ⓤ) (ws  )) (term (ident Ⓥ) (ws  )) (term (ident Ⓦ) (ws  )) (term (ident Ⓧ) (ws  )) (term (ident Ⓨ) (ws  )) (term (ident Ⓩ) ws))) ; (ws \n  ) (declaration (property_ (ident suffix) ws) : (ws  ) (expr (term " " ws))) ; (ws \n)) } (ws \n))) <EOF>)
---
> (stylesheet ws (nestedStatement (counterStyle @counter-style (ws  ) (ident circled-alpha) (ws  ) { (ws \n  ) (declarationList (declaration (property_ (ident system) ws) : (ws  ) (expr (term (ident fixed) ws))) ws ; (ws \n  ) (declaration (property_ (ident symbols) ws) : (ws  ) (expr (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) (ws  )) (term (ident ?) ws))) ; (ws \n  ) (declaration (property_ (ident suffix) ws) : (ws  ) (expr (term " " ws))) ; (ws \n)) } (ws \n))) <EOF>)

@marcospassos
Copy link
Collaborator

marcospassos commented Mar 6, 2023

@kaby76, I just didn't invest more time here to refine the implementation because you said there are still some differences.

If you could fix these differences, I'll gladly review, test and finish everything else. I want to make sure there are no differences left before merging this.

@kaby76
Copy link
Author

kaby76 commented Mar 6, 2023

That's fine. I will come back to this after going through the tree diffing implementation for grammars-v4. We can then worry about improving performance.

@marcospassos
Copy link
Collaborator

Great! Do you have any ETA? I'm asking because I want to reserve a time slot on my side to work on this

@kaby76
Copy link
Author

kaby76 commented Mar 6, 2023

I'm probably going to revisit this next week. The changes for antlr/grammars-v4#3138 are nearly done, checking in today and tomorrow. The changes for antlr/grammars-v4#2991 is really easy: just remaster all the .tree files and turn off any ports that aren't producing a correct tree.

@kaby76
Copy link
Author

kaby76 commented Mar 6, 2023

@marcospassos Is it possible to get a release for the current 4.12.0 for the PHP runtime? That would in setting up the testing for PHP for grammars-v4 so I don't need to special case the PHP tests to use the older 4.11.x tool.

@marcospassos
Copy link
Collaborator

Sure!

@marcospassos
Copy link
Collaborator

@kaby76 done

@kaby76
Copy link
Author

kaby76 commented Mar 6, 2023

Thanks!! It works!

@marcospassos
Copy link
Collaborator

marcospassos commented Apr 10, 2023

@kaby76, could you please share the instructions for getting the remaining diffs? You probably didn't have time to return to this, so I'll try to take over. I just need to know what is still wrong and how to reproduce.

Also, did you try against the Java implementation that's the reference? We can't just compare PHP and C# and state that PHP is wrong. The Java target is the ground truth.

@kaby76
Copy link
Author

kaby76 commented Apr 10, 2023

Sorry, I haven't had much time to recover where I was because there are so many other issues going on in grammars-v4 (stuck on the scala grammar).

The diff was in the trace for css3 as mentioned here: #34 (comment)

grammar css3
input bootstrap-theme.min.css

To test, parse trace must be on (PHP $parser->setTrace(true);, php -d memory_limit=1G Test.php; Java parser.setTrace(true);). I think I also tested using the internal atn tracing as well to check the set computations. I don't have diffs because in general the files would are gigabytes long.

To turn on atn tracing: PHP Antlr\Antlr4\Runtime\Atn\ParserATNSimulator::$traceAtnSimulation = true;; Java ParserATNSimulator.trace_atn_sim = true;). CSharp and Java always agreed, all gigabyte lengthed files. After identifying computation divergence, I used side-by-side stepping to find data structure computation errors in PHP. I used CSharp because the CSharp tools are far, far superior in almost every way over multiple Java debugging tools. For example, there is no set program counter and go in Java.

I expect to find more tree diff errors across targets for grammars-v4, but I haven't tried turning it on yet.

@marcospassos
Copy link
Collaborator

marcospassos commented Apr 10, 2023

I just came up with an idea on how to help find the investigation starting point:

  • Generate the tracing diff in the target used as the ground truth (i.e., CSS3 for Java).
  • Use a special logger implementation that, instead of writing, checks if the current line matches the entry being written
  • If it doesn't, throw an exception. This is precisely the diverging point.

I'll test this approach. Provided it works, we can test it against any grammar and quickly identify under which conditions it happens.

What do you think, @kaby76?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants