Change dynamic allocas to static allocas. #1492

erick-xanadu · 2025-02-03T23:42:55Z

Context: Allocas grow the stack. If an alloca is placed anywhere else except the function's entry block, it means that the alloca may execute more than once. Executing enough times may lead to a stack overflow.

Description of the Change:

Simple algorithm for moving static allocas to the beginning of basic block

Benefits: No stack overflows

Possible Drawbacks: None!

[sc-83313]

github-actions · 2025-02-05T19:38:47Z

Hello. You may have forgotten to update the changelog!
Please edit doc/releases/changelog-dev.md on your branch with:

A one-to-two sentence description of the change. You may include a small working example for new features.
A link back to this PR.
Your name (or GitHub username) in the contributors section.

frontend/test/pytest/test_jax_linalg_in_circuit.py

mlir/test/Quantum/ConversionTest.mlir

mehrdad2m · 2025-02-05T21:43:48Z

Have you done any measurements on how this affects the timing and memory usages of benchmarks? I think this solution is a bit aggressive, but probably works for now. One potential problem I see is that we are basically extending the lifetime of all the allocas (I think that is why you call them static alloca right?). This may cause register pressure in future when we scale to larger circuits and consequently can affect performance. I doubt it has any effect now, but that is why I asked about the benchmarks.

erick-xanadu · 2025-02-05T22:00:44Z

Have you done any measurements on how this affects the timing and memory usages of benchmarks? I think this solution is a bit aggressive, but probably works for now. One potential problem I see is that we are basically extending the lifetime of all the allocas (I think that is why you call them static alloca right?). This may cause register pressure in future when we scale to larger circuits and consequently can affect performance. I doubt it has any effect now, but that is why I asked about the benchmarks.

I have not, benchmarked it.

I don't believe this solution is aggressive. Static allocas are a thing already in LLVM.

From Performance Tips for Frontend Authors

An alloca instruction can be used to represent a function scoped stack slot, but can also represent dynamic frame expansion. When representing function scoped variables or locations, placing alloca instructions at the beginning of the entry block should be preferred. In particular, place them before any call instructions. Call instructions might get inlined and replaced with multiple basic blocks. The end result is that a following alloca instruction would no longer be in the entry basic block afterward.

The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt to eliminate alloca instructions that are in the entry basic block. Given SSA is the canonical form expected by much of the optimizer; if allocas can not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less effective than it could be.

I'll still benchmark it though!

Regarding memory usage: this won't show up on heap memory since it is all about stack. And now we are reusing stack. So the stack memory use decreases.

This may cause register pressure

I imagine that liveness analysis should know and determine when to spill, but we are also not using the best register allocation. It has been noted that due to large having large functions the compilation time of register allocation was a bottle neck. A solutio to this is to split up functions and avoid inlining. But we've always had the invariant that the quantum operations are only allowed in the qnode function. We need to develop the abstraction of a quantum function that is not the entry point of the function and extend the gradient analysis pass to support this feature. The first item is easy, the second one I am not sure.

mlir/lib/Catalyst/Utils/StaticAllocas.cpp

mehrdad2m · 2025-02-05T22:36:26Z

I imagine that liveness analysis should know and determine when to spill, but we are also not using the best register allocation. It has been noted that due to large having large functions the compilation time of register allocation was a bottle neck. A solutio to this is to split up functions and avoid inlining. But we've always had the invariant that the quantum operations are only allowed in the qnode function. We need to develop the abstraction of a quantum function that is not the entry point of the function and extend the gradient analysis pass to support this feature. The first item is easy, the second one I am not sure.

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

erick-xanadu · 2025-02-05T22:43:44Z

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

Additionally, I don't think static stack allocation affect register spilling (?) If I understand correctly, stack allocations are just constant offset on the stack pointer register. And you never spill the stack pointer register.

erick-xanadu · 2025-02-06T13:35:40Z

@jay-selby , I've addressed your comments. Thanks for the review!

erick-xanadu · 2025-02-06T14:26:20Z

The [memref.]alloca operation allocates memory on the stack, to be automatically released when control transfers back from the region of its closest surrounding operation with an AutomaticAllocationScope trait. The amount of memory allocated is specified by its memref and additional operands.

From here

I'll look a little bit more into this, but if this is the case, at least for memref.alloca moving them to the entry block may not be better. This just affects gradients.

mlir/lib/Gradient/Transforms/GradMethods/PS_QuantumGradient.cpp

jay-selby · 2025-02-06T14:53:22Z

Thanks for the info! I definitely agree that having static allocas help with stack memory reuse. My only concern was the register spillage. I am not an expert on register allocation, so not sure if it can optimally spill in both cases, but I assumed that increasing the lifetime of allocas would result in more spillage and that causes performance degradation.

Additionally, I don't think static stack allocation affect register spilling (?) If I understand correctly, stack allocations are just constant offset on the stack pointer register. And you never spill the stack pointer register.

You do, but it is typically wrapped up into the calling convention. For example, on function entry save all registers with something like:

  # save all regs (typically includes sp)
  pusha
  # alloc stack space for locals
  sub sp, sp, 128
  ...
  # reload saved regs
  popa
  ret

codecov · 2025-02-06T19:14:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.73%. Comparing base (910ce27) to head (8810422).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1492   +/-   ##
=======================================
  Coverage   96.73%   96.73%           
=======================================
  Files          76       76           
  Lines        8219     8219           
  Branches      779      779           
=======================================
  Hits         7951     7951           
  Misses        213      213           
  Partials       55       55

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dime10

Nice work Erick, this is really helpful!

mlir/include/Catalyst/Utils/StaticAllocas.h

dime10 · 2025-02-07T10:51:08Z

mlir/lib/Catalyst/Utils/StaticAllocas.cpp

+        // Move the value at the beginning
+        Operation *value_def = value.getDefiningOp();
+        rewriter.moveOpBefore(value_def, &entryBlock->front());


The value you are moving is the size of the allocation (which is always a constant)? In that case it would be safer for your helper function to accept an int and instantiate the size constant here. This enforces your constraint rather than assuming it.

That was my original design, but it is not a good one.

It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas

It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.

It enforces the creation of a new constant for each alloca when in the code we sometimes only have one and reuse it across multiple allocas

I think this is not too bad, the canonicalizer will fuse duplicate constants anyways.

It prevents the use of GEP, which is also a constant, but is a constant that is harder for us to compute than it is for the GEP instruction.

I didn't see GEP used anywhere in your PR as argument to your function, the only related lines where these (GEP appears in the code context), but the size argument is still from a constant instruction 🤔 Can you explain where/how you use GEP as a size value?

auto numControlledVal = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI64IntegerAttr(controlledQubits.size())).getResult(); ... ctrlPtr = catalyst::getStaticAlloca(loc, rewriter, ptrType, numControlledVal).getResult(); valuePtr = catalyst::getStaticAlloca(loc, rewriter, boolType, numControlledVal).getResult();

You are right. I had GEPs values used on a previous version of the code where we also moved the memref::AllocaOp to the top of the stack, but reading the semantics, moving memref::AllocaOp did not make sense because the memory created is actually released upon scope. (That is at least according to the documentation).

The alloca operation allocates memory on the stack, to be automatically released when control transfers back from the region of its closest surrounding operation with an AutomaticAllocationScope trait.

Oh right, so you think if we stack allocate a memref in a loop it would be deallocated each iteration anyways? I wonder if this is enforced in the standard lowering.
I can see that both func.func and scf.for carry the AutomaticAllocationScope trait, so that should cover most cases, but it is absent from scf.while 🤔

🤔 I did look and found it in scf.for but I forgot to look for scf.while, in the case, I will revert this change.

The GEP may have been what you pointed out. So maybe they are indeed all constants. Ok, I'll re-add the memref::AllocaOp static where it is needed and change it to a constant.

erick-xanadu force-pushed the eochoa/2025-02-03/static-alloca branch from 7c58fce to 217a959 Compare February 4, 2025 14:25

erick-xanadu mentioned this pull request Feb 5, 2025

[chore] investigate double terminator in basic blocks #1496

Open

erick-xanadu marked this pull request as ready for review February 5, 2025 19:38

erick-xanadu changed the title ~~Change dynamic allocas to static allocas.~~ 🚧 Change dynamic allocas to static allocas. Feb 5, 2025

erick-xanadu changed the title ~~🚧 Change dynamic allocas to static allocas.~~ Change dynamic allocas to static allocas. Feb 5, 2025

erick-xanadu commented Feb 5, 2025

View reviewed changes

frontend/test/pytest/test_jax_linalg_in_circuit.py Outdated Show resolved Hide resolved

erick-xanadu commented Feb 5, 2025

View reviewed changes

mlir/test/Quantum/ConversionTest.mlir Outdated Show resolved Hide resolved

erick-xanadu requested a review from a team February 5, 2025 20:37

jay-selby reviewed Feb 5, 2025

View reviewed changes

mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved

mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved

mlir/lib/Catalyst/Utils/StaticAllocas.cpp Outdated Show resolved Hide resolved

erick-xanadu commented Feb 6, 2025

View reviewed changes

mlir/lib/Gradient/Transforms/GradMethods/PS_QuantumGradient.cpp Outdated Show resolved Hide resolved

erick-xanadu added 12 commits February 6, 2025 14:00

Make a to create alloca

2a09a72

f

1279182

f

a857291

f

24ef9f8

F

c240434

Another one

6b2929a

Another one

3a0af98

Another

6c050c2

Move constant to very beginning

c567d97

Another one

eca1d9e

Another one

a7f1780

Another one

ca90395

erick-xanadu and others added 24 commits February 6, 2025 14:00

Add test for qubit unitary

7898cbd

Adding lit tests

bcb0623

Add test for controlled qubit unitary

160c09e

Add test for hermitian

6188f3c

Another test

f86e28e

test for sample

2ba8919

Another test

dd9f64b

Add test

2a4ce1d

Test for set basis state

d2bedbd

Add test for print

1489320

Another test

b305399

Test for callback

75b6304

tests for gradient

9899fcc

Apply suggestions from code review

0cf8126

Add gradient test

31c67f6

Increase locality

5e10df6

new order

4eb6707

Fix test

acf128a

Remove noop

5687374

Address comments from review

473fb83

Remove static memref alloca hoisting

a70cf1e

Revert memref::AllocaOp since it has automatic scoping

746e47c

Remove static alloca tests for gradient

c6f4653

Update mlir/lib/Gradient/Transforms/GradMethods/PS_QuantumGradient.cpp

16ca9c0

erick-xanadu force-pushed the eochoa/2025-02-03/static-alloca branch from 080de3e to 16ca9c0 Compare February 6, 2025 19:00

erick-xanadu requested review from jay-selby and a team February 6, 2025 19:00

dime10 requested changes Feb 7, 2025

View reviewed changes

Update mlir/include/Catalyst/Utils/StaticAllocas.h

8810422

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change dynamic allocas to static allocas. #1492

Change dynamic allocas to static allocas. #1492

erick-xanadu commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 5, 2025

mehrdad2m commented Feb 5, 2025 •

edited

Loading

erick-xanadu commented Feb 5, 2025 •

edited

Loading

mehrdad2m commented Feb 5, 2025

erick-xanadu commented Feb 5, 2025 •

edited

Loading

erick-xanadu commented Feb 6, 2025

erick-xanadu commented Feb 6, 2025

jay-selby commented Feb 6, 2025

codecov bot commented Feb 6, 2025 •

edited

Loading

dime10 left a comment

dime10 Feb 7, 2025

erick-xanadu Feb 7, 2025

dime10 Feb 7, 2025

erick-xanadu Feb 7, 2025

dime10 Feb 7, 2025

erick-xanadu Feb 7, 2025

erick-xanadu Feb 7, 2025

Change dynamic allocas to static allocas. #1492

Are you sure you want to change the base?

Change dynamic allocas to static allocas. #1492

Conversation

erick-xanadu commented Feb 3, 2025 • edited Loading

github-actions bot commented Feb 5, 2025

mehrdad2m commented Feb 5, 2025 • edited Loading

erick-xanadu commented Feb 5, 2025 • edited Loading

mehrdad2m commented Feb 5, 2025

erick-xanadu commented Feb 5, 2025 • edited Loading

erick-xanadu commented Feb 6, 2025

erick-xanadu commented Feb 6, 2025

jay-selby commented Feb 6, 2025

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

dime10 left a comment

Choose a reason for hiding this comment

dime10 Feb 7, 2025

Choose a reason for hiding this comment

erick-xanadu Feb 7, 2025

Choose a reason for hiding this comment

dime10 Feb 7, 2025

Choose a reason for hiding this comment

erick-xanadu Feb 7, 2025

Choose a reason for hiding this comment

dime10 Feb 7, 2025

Choose a reason for hiding this comment

erick-xanadu Feb 7, 2025

Choose a reason for hiding this comment

erick-xanadu Feb 7, 2025

Choose a reason for hiding this comment

erick-xanadu commented Feb 3, 2025 •

edited

Loading

mehrdad2m commented Feb 5, 2025 •

edited

Loading

erick-xanadu commented Feb 5, 2025 •

edited

Loading

erick-xanadu commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading