You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does your result include an ablation of the operator properties of torch? I have noticed that the model operation speed of transformer on torch2.0 is significantly faster than that of torch1.x
The text was updated successfully, but these errors were encountered:
The results in the paper are with torch 1.x and no FlashAttention (unless explicitly mentioned). The released code has plenty of performance optimizations and is significantly faster than what is reported in the paper.
Does your result include an ablation of the operator properties of torch? I have noticed that the model operation speed of transformer on torch2.0 is significantly faster than that of torch1.x
The text was updated successfully, but these errors were encountered: