You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Above all, we allow --baseline to customize using any backend as the baseline when running the benchmark.
For the default baseline impl, I think we should prioritize coverage as the default baseline so that it is broadly available.
For example, if we assign cutlass (like flash_v3)/cudnn/cublas as the baseline to Triton, they won't be available on AMD.
Maybe we should always use the default torch/aten operator as the baseline, if that is available. However, also note that the torch/aten impl might be very slow and might cause OOM on large size of inputs.
Or maybe we can avoid having a default baseline backend at all in the code and always require user to specify one when they want relative metrics like speedup or memory_compression_ratio.
The baselines do not all seem to represent the best in class version of the kernels. Lets audit and see where we can improve this.
Example: when we have flash_v3 available that really should be the baseline IMO but when it isnt we can default to sdpa as we do now.
The text was updated successfully, but these errors were encountered: