-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change dynamic allocas to static allocas. #1492
base: main
Are you sure you want to change the base?
Conversation
7c58fce
to
217a959
Compare
Hello. You may have forgotten to update the changelog!
|
Have you done any measurements on how this affects the timing and memory usages of benchmarks? I think this solution is a bit aggressive, but probably works for now. One potential problem I see is that we are basically extending the lifetime of all the allocas (I think that is why you call them static alloca right?). This may cause register pressure in future when we scale to larger circuits and consequently can affect performance. I doubt it has any effect now, but that is why I asked about the benchmarks. |
I have not, benchmarked it. I don't believe this solution is aggressive. Static allocas are a thing already in LLVM. From Performance Tips for Frontend Authors
I'll still benchmark it though! Regarding memory usage: this won't show up on heap memory since it is all about stack. And now we are reusing stack. So the stack memory use decreases.
I imagine that liveness analysis should know and determine when to spill, but we are also not using the best register allocation. It has been noted that due to large having large functions the compilation time of register allocation was a bottle neck. A solutio to this is to split up functions and avoid inlining. But we've always had the invariant that the quantum operations are only allowed in the qnode function. We need to develop the abstraction of a quantum function that is not the entry point of the function and extend the gradient analysis pass to support this feature. The first item is easy, the second one I am not sure. |
Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation. |
Additionally, I don't think static stack allocation affect register spilling (?) If I understand correctly, stack allocations are just constant offset on the stack pointer register. And you never spill the stack pointer register. |
@jay-selby , I've addressed your comments. Thanks for the review! |
I'll look a little bit more into this, but if this is the case, at least for memref.alloca moving them to the entry block may not be better. This just affects gradients. |
mlir/lib/Gradient/Transforms/GradMethods/PS_QuantumGradient.cpp
Outdated
Show resolved
Hide resolved
You do, but it is typically wrapped up into the calling convention. For example, on function entry save all registers with something like:
|
080de3e
to
16ca9c0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1492 +/- ##
=======================================
Coverage 96.73% 96.73%
=======================================
Files 76 76
Lines 8219 8219
Branches 779 779
=======================================
Hits 7951 7951
Misses 213 213
Partials 55 55 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Erick, this is really helpful!
// Move the value at the beginning | ||
Operation *value_def = value.getDefiningOp(); | ||
rewriter.moveOpBefore(value_def, &entryBlock->front()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value you are moving is the size of the allocation (which is always a constant)? In that case it would be safer for your helper function to accept an int
and instantiate the size constant here. This enforces your constraint rather than assuming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original design, but it is not a good one.
- It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas
- It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas
I think this is not too bad, the canonicalizer will fuse duplicate constants anyways.
It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.
I didn't see GEP used anywhere in your PR as argument to your function, the only related lines where these (GEP appears in the code context), but the size argument is still from a constant instruction 🤔 Can you explain where/how you use GEP as a size value?
auto numControlledVal = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI64IntegerAttr(controlledQubits.size())).getResult();
...
ctrlPtr = catalyst::getStaticAlloca(loc, rewriter, ptrType, numControlledVal).getResult();
valuePtr = catalyst::getStaticAlloca(loc, rewriter, boolType, numControlledVal).getResult();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I had GEPs values used on a previous version of the code where we also moved the memref::AllocaOp to the top of the stack, but reading the semantics, moving memref::AllocaOp did not make sense because the memory created is actually released upon scope. (That is at least according to the documentation).
The alloca operation allocates memory on the stack, to be automatically released when control transfers back from the region of its closest surrounding operation with an AutomaticAllocationScope trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, so you think if we stack allocate a memref in a loop it would be deallocated each iteration anyways? I wonder if this is enforced in the standard lowering.
I can see that both func.func
and scf.for
carry the AutomaticAllocationScope
trait, so that should cover most cases, but it is absent from scf.while
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I did look and found it in scf.for
but I forgot to look for scf.while
, in the case, I will revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GEP may have been what you pointed out. So maybe they are indeed all constants. Ok, I'll re-add the memref::AllocaOp static where it is needed and change it to a constant.
Context: Allocas grow the stack. If an alloca is placed anywhere else except the function's entry block, it means that the alloca may execute more than once. Executing enough times may lead to a stack overflow.
Description of the Change:
Simple algorithm for moving static allocas to the beginning of basic block
Benefits: No stack overflows
Possible Drawbacks: None!
[sc-83313]