Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dynamic allocas to static allocas. #1492

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

erick-xanadu
Copy link
Contributor

@erick-xanadu erick-xanadu commented Feb 3, 2025

Context: Allocas grow the stack. If an alloca is placed anywhere else except the function's entry block, it means that the alloca may execute more than once. Executing enough times may lead to a stack overflow.

Description of the Change:

Simple algorithm for moving static allocas to the beginning of basic block

Benefits: No stack overflows

Possible Drawbacks: None!

[sc-83313]

@erick-xanadu erick-xanadu force-pushed the eochoa/2025-02-03/static-alloca branch from 7c58fce to 217a959 Compare February 4, 2025 14:25
@erick-xanadu erick-xanadu marked this pull request as ready for review February 5, 2025 19:38
Copy link
Contributor

github-actions bot commented Feb 5, 2025

Hello. You may have forgotten to update the changelog!
Please edit doc/releases/changelog-dev.md on your branch with:

  • A one-to-two sentence description of the change. You may include a small working example for new features.
  • A link back to this PR.
  • Your name (or GitHub username) in the contributors section.

@erick-xanadu erick-xanadu changed the title Change dynamic allocas to static allocas. 🚧 Change dynamic allocas to static allocas. Feb 5, 2025
@erick-xanadu erick-xanadu changed the title 🚧 Change dynamic allocas to static allocas. Change dynamic allocas to static allocas. Feb 5, 2025
@erick-xanadu erick-xanadu requested a review from a team February 5, 2025 20:37
@mehrdad2m
Copy link
Contributor

mehrdad2m commented Feb 5, 2025

Have you done any measurements on how this affects the timing and memory usages of benchmarks? I think this solution is a bit aggressive, but probably works for now. One potential problem I see is that we are basically extending the lifetime of all the allocas (I think that is why you call them static alloca right?). This may cause register pressure in future when we scale to larger circuits and consequently can affect performance. I doubt it has any effect now, but that is why I asked about the benchmarks.

@erick-xanadu
Copy link
Contributor Author

erick-xanadu commented Feb 5, 2025

Have you done any measurements on how this affects the timing and memory usages of benchmarks? I think this solution is a bit aggressive, but probably works for now. One potential problem I see is that we are basically extending the lifetime of all the allocas (I think that is why you call them static alloca right?). This may cause register pressure in future when we scale to larger circuits and consequently can affect performance. I doubt it has any effect now, but that is why I asked about the benchmarks.

I have not, benchmarked it.

I don't believe this solution is aggressive. Static allocas are a thing already in LLVM.

From Performance Tips for Frontend Authors

An alloca instruction can be used to represent a function scoped stack slot, but can also represent dynamic frame expansion. When representing function scoped variables or locations, placing alloca instructions at the beginning of the entry block should be preferred. In particular, place them before any call instructions. Call instructions might get inlined and replaced with multiple basic blocks. The end result is that a following alloca instruction would no longer be in the entry basic block afterward.

The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt to eliminate alloca instructions that are in the entry basic block. Given SSA is the canonical form expected by much of the optimizer; if allocas can not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less effective than it could be.

I'll still benchmark it though!

Regarding memory usage: this won't show up on heap memory since it is all about stack. And now we are reusing stack. So the stack memory use decreases.

This may cause register pressure

I imagine that liveness analysis should know and determine when to spill, but we are also not using the best register allocation. It has been noted that due to large having large functions the compilation time of register allocation was a bottle neck. A solutio to this is to split up functions and avoid inlining. But we've always had the invariant that the quantum operations are only allowed in the qnode function. We need to develop the abstraction of a quantum function that is not the entry point of the function and extend the gradient analysis pass to support this feature. The first item is easy, the second one I am not sure.

mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved
mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved
mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved
@mehrdad2m
Copy link
Contributor

I imagine that liveness analysis should know and determine when to spill, but we are also not using the best register allocation. It has been noted that due to large having large functions the compilation time of register allocation was a bottle neck. A solutio to this is to split up functions and avoid inlining. But we've always had the invariant that the quantum operations are only allowed in the qnode function. We need to develop the abstraction of a quantum function that is not the entry point of the function and extend the gradient analysis pass to support this feature. The first item is easy, the second one I am not sure.

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

@erick-xanadu
Copy link
Contributor Author

erick-xanadu commented Feb 5, 2025

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

Additionally, I don't think static stack allocation affect register spilling (?) If I understand correctly, stack allocations are just constant offset on the stack pointer register. And you never spill the stack pointer register.

@erick-xanadu
Copy link
Contributor Author

@jay-selby , I've addressed your comments. Thanks for the review!

@erick-xanadu
Copy link
Contributor Author

The [memref.]alloca operation allocates memory on the stack, to be automatically released when control transfers back from the region of its closest surrounding operation with an AutomaticAllocationScope trait. The amount of memory allocated is specified by its memref and additional operands.

From here

I'll look a little bit more into this, but if this is the case, at least for memref.alloca moving them to the entry block may not be better. This just affects gradients.

@jay-selby
Copy link
Contributor

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

Additionally, I don't think static stack allocation affect register spilling (?) If I understand correctly, stack allocations are just constant offset on the stack pointer register. And you never spill the stack pointer register.

You do, but it is typically wrapped up into the calling convention. For example, on function entry save all registers with something like:

  # save all regs (typically includes sp)
  pusha
  # alloc stack space for locals
  sub sp, sp, 128
  ...
  # reload saved regs
  popa
  ret

@erick-xanadu erick-xanadu force-pushed the eochoa/2025-02-03/static-alloca branch from 080de3e to 16ca9c0 Compare February 6, 2025 19:00
@erick-xanadu erick-xanadu requested review from jay-selby and a team February 6, 2025 19:00
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.73%. Comparing base (910ce27) to head (8810422).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1492   +/-   ##
=======================================
  Coverage   96.73%   96.73%           
=======================================
  Files          76       76           
  Lines        8219     8219           
  Branches      779      779           
=======================================
  Hits         7951     7951           
  Misses        213      213           
  Partials       55       55           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dime10 dime10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Erick, this is really helpful!

mlir/include/Catalyst/Utils/StaticAllocas.h Outdated Show resolved Hide resolved
Comment on lines +31 to +33
// Move the value at the beginning
Operation *value_def = value.getDefiningOp();
rewriter.moveOpBefore(value_def, &entryBlock->front());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value you are moving is the size of the allocation (which is always a constant)? In that case it would be safer for your helper function to accept an int and instantiate the size constant here. This enforces your constraint rather than assuming it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original design, but it is not a good one.

  1. It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas
  2. It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas

I think this is not too bad, the canonicalizer will fuse duplicate constants anyways.

It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.

I didn't see GEP used anywhere in your PR as argument to your function, the only related lines where these (GEP appears in the code context), but the size argument is still from a constant instruction 🤔 Can you explain where/how you use GEP as a size value?

    auto numControlledVal = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI64IntegerAttr(controlledQubits.size())).getResult();
    ...
        ctrlPtr = catalyst::getStaticAlloca(loc, rewriter, ptrType, numControlledVal).getResult();
        valuePtr = catalyst::getStaticAlloca(loc, rewriter, boolType, numControlledVal).getResult();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I had GEPs values used on a previous version of the code where we also moved the memref::AllocaOp to the top of the stack, but reading the semantics, moving memref::AllocaOp did not make sense because the memory created is actually released upon scope. (That is at least according to the documentation).

The alloca operation allocates memory on the stack, to be automatically released when control transfers back from the region of its closest surrounding operation with an AutomaticAllocationScope trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, so you think if we stack allocate a memref in a loop it would be deallocated each iteration anyways? I wonder if this is enforced in the standard lowering.
I can see that both func.func and scf.for carry the AutomaticAllocationScope trait, so that should cover most cases, but it is absent from scf.while 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I did look and found it in scf.for but I forgot to look for scf.while, in the case, I will revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GEP may have been what you pointed out. So maybe they are indeed all constants. Ok, I'll re-add the memref::AllocaOp static where it is needed and change it to a constant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants