Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection happens too late resulting in huge memory consumption #5023

Open
asl opened this issue Nov 21, 2024 · 4 comments
Open

Garbage collection happens too late resulting in huge memory consumption #5023

asl opened this issue Nov 21, 2024 · 4 comments
Labels
compiler-performance Topics on improving the performance of the compiler core. core Topics concerning the core segments of the compiler (frontend, midend, parser)

Comments

@asl
Copy link
Contributor

asl commented Nov 21, 2024

Recently I was profiling the memory consumption of some large apps and it turns out that the memory allocation / deallocation pattern is very concerning.

I collected the GC used memory statistics as reported via -T pass_manager:3, so essentially each data collection also triggers a full GC collection cycle.

In the graph below is the used memory for the frontend only. The x-axis is the pass index and the y-axis is memory allocation in megabytes.

image

What we can see here:

  • Overall almost 900 passes are executed in the frontend (this includes nested pass managers, but does not include on-fly visitor invocations)
  • The memory consumption steadily grows until the end of the frontend reaching ~1.5 Gb with some spikes up to 5 Gb.
  • The area between blue lines is the SimplifyDefUse, so all the spikes belong to defus being memory hog. But let's put this aside in this issue
  • The upward trend right from the last blue line is inliner and passes around inliner
  • There is huge drop of the used memory after frontend is finished: the last point is the first pass of midend.

Indeed, the situation looks like this:

heap after HierarchicalNames: in use 1.5GB, max 1.6GB
FrontEnd invoking FrontEndLast
heap after FrontEndLast: in use 1.5GB, max 1.6GB
MidEnd invoking ConvertEnums
heap after ConvertEnums: in use 589MB, max 1.7GB

Note that Frontend (as pass manager) was responsible for almost 2/3 of the memory that was not collected.

And indeed, the pass manager is pretty hostile to GC:

  • PassManager owns passes and holds pointers to them
  • Many passes have internal state (e.g. maps or just pointer to node objects) that are still reachable after pass execution
  • As a result we're ending with lots of GC roots that could only be collected after top-level pass-manager would be deleted.

I sprinkled some finalizers on P4Program and it looks like bunch of those are only collected together with Frontend. And this is suspicious as these are almost never held in the internal state of the passes besides various ProgramMap's which is mostly TypeMap as of now.

The particular "leak" pattern looks like this (this is with additional tracing for clones and collection added):

    DefaultArguments invoking DoDefaultArguments
Created <P4Program>(12469767)
    heap after DoDefaultArguments: in use 325MB, max 389MB
    DefaultArguments invoking P4::ClearTypeMap
Program has changed from <P4Program>(6781157) to <P4Program>(12469767)
    heap after P4::ClearTypeMap: in use 266MB, max 389MB
    DefaultArguments invoking P4::TypeInference
Created <P4Program>(13001472)
Program has changed from <P4Program>(512720) (empty) to <P4Program>(12469767)
TypeMap updated to <P4Program>(13001472)
    heap after P4::TypeInference: in use 331MB, max 406MB
  heap after DefaultArguments: in use 331MB, max 406MB
  P4::PassRepeated invoking P4::(anonymous namespace)::SetStrictStruct
  heap after P4::(anonymous namespace)::SetStrictStruct: in use 331MB, max 406MB
  P4::PassRepeated invoking P4::TypeInference

So, what happens here:

  • DoDefaultArguments did something as a result new <P4Program>(12469767) was cloned and clone was not collected as being unused.
  • Then TypeInference also inserted some real casts and therefore the P4Program was cloned again into <P4Program>(13001472)

The problem here is that <P4Program>(12469767) is only collected when Frontend dies, so it's pointer is being stored somewhere. However, I was unable to track this down so far:

  • TypeMap holds pointer to P4Program, however, it was already updated to a new one.
  • TypeInfenceBase indeed holds the pointer of top-level node (initialNode), but clearing it in end_apply did not change anything.
@asl asl added core Topics concerning the core segments of the compiler (frontend, midend, parser) compiler-performance Topics on improving the performance of the compiler core. labels Nov 21, 2024
@asl
Copy link
Contributor Author

asl commented Nov 21, 2024

Update: the last mistery with P4Program being held is solved. It is in ResolutionContext cache that needs to be cleared. Now we at least need to sprinkle bunch of end_apply's

@asl
Copy link
Contributor Author

asl commented Nov 25, 2024

Tagging @vlstill @fruffy @ChrisDodd for insights.

The case outlined above is pretty mild. I saw other cases of erratic behavior:

  • Memory spikes up to 32 Gb due to GC happens too late to clean up temporary objects created by use-def
  • Memory consumption being close to some GC collection threshold. As a result, GC is triggered after every ComputeWriteSet call (!) resulting in +100 seconds of the compile time

@vlstill
Copy link
Contributor

vlstill commented Nov 25, 2024

I suspect one problem would be a common pattern in visitors that collect information during the run and clear internal state in init_apply. Therefore, anything reachable through the internal state is kept alive until the pass is applied again (which it is often not, and if it is new data is kept this way). The better approach is of course to clear anything not publicly visible (i.e. information collected for other passes) in end_apply. The slight disadvantage is that you then end up with two inits for some fields (i.e. zeoring an integer fields needs to happen both in a pass constructor and in end_apply).

One alternative could be to allow a pass manager to (instead of passes) hold pass factory functions that preserve the construction parameters and construct passes only before they are run and destruct them after the run. This would also strongly discourage cross-talk between repeated runs of a pass.

Another problem could be in local pass managers (which are called e.g. from frontend) which heap-allocate a state that is shared across the passes (or access info from earlier passes by later passes in the manager). Again, such state can easily become a new root released only when the (nested) pass manager is released. Here, clearing the memory requires even more work than just an end_apply as it needs to happen at the end of the pass manager (or earlier after the data is no longer needed).

I am not sure how wide-spread these patterns are in frontend. It might be worth investigating starting from passes which are known to retain a lot of memory.

Sadly, I'd say that by applying GC to everything (as opposed to e.g. apply it only to IR nodes), the original P4C authors created an illusion that memory is not a problem. This discourages thinking about memory lifetimes, which eventually becomes a problem :-(.

@asl
Copy link
Contributor Author

asl commented Nov 25, 2024

I am not sure how wide-spread these patterns are in frontend. It might be worth investigating starting from passes which are known to retain a lot of memory.

I spent quite some time on this, and there are lots of cases like this. Typical scenario is that some information is collected in several inspectors, then passed down to transform, etc. Certainly, the internal state is almost never cleared, many passes implicitly assumes that they are only run once. Those who have to deal with the state cleanup is ones running in PassRepeated context. But still, they are usually clearing state in init_apply instead of end_apply, so we left with the values collected last.

One alternative could be to allow a pass manager to (instead of passes) hold pass factory functions that preserve the construction parameters and construct passes only before they are run and destruct them after the run.

The cheap alternative is for existing PassManager to destroy passes after they are applied. This essentially implies that passes got executed once (certainly, PassRepeated would need to kill everything in the end), but maybe this could be solved via explicit clone when necessary for those rare cases that apply PassManager's to multiple top-level nodes. In any case, these cases should already be prepared to ensure proper state isolation, so additional clone call won't be crazy there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler-performance Topics on improving the performance of the compiler core. core Topics concerning the core segments of the compiler (frontend, midend, parser)
Projects
None yet
Development

No branches or pull requests

2 participants