Thank you for developing and sharing such an excellent project.
I am currently using Windows 11 + RTX 4090 GPU with the following environment:
**- PyTorch 2.6.0
-
CUDA 12.4
-
xFormers 0.0.29.post3
-
Triton 3.2.0**
I used the default parameters in script/profile_speed.py
(Checkpoint: 23-36-37, valid_iters: 8) to run the speed test.
However, the inference time is around 57 ms, which is not faster than the 49.4 ms reported in your test results, and is in fact slightly slower.
I would like to ask whether there are any possible reasons that could cause this difference in performance.
Thank you.

Thank you for developing and sharing such an excellent project.
I am currently using Windows 11 + RTX 4090 GPU with the following environment:
**- PyTorch 2.6.0
CUDA 12.4
xFormers 0.0.29.post3
Triton 3.2.0**
I used the default parameters in script/profile_speed.py
(Checkpoint: 23-36-37, valid_iters: 8) to run the speed test.
However, the inference time is around 57 ms, which is not faster than the 49.4 ms reported in your test results, and is in fact slightly slower.
I would like to ask whether there are any possible reasons that could cause this difference in performance.
Thank you.