-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorpopration with llama 2? #8
Comments
I believe there is no problem with using Flash-LLM kernel on llama 2. Flash-LLM mainly consists of the high-performance SpMM GPU kernel, which should be efficient for all existing LLM inference MatMul shapes. |
Does llama 2 have such high unstructure sparsity? And can our method be combined with quantification? |
We do have ongoing research that achieves 70% unstructured sparsity on llama 2 with negligible accuracy loss, that's why we want to see the speed gain from removing those weights. |
That's
That's amazing! To my knowledge, The sparsegpt and Wanda methods significantly increase ppl when llama has 70% sparsity |
That is amazing! For 70% unstructured sparsity, I believe even better performance can be achieved compared to our existing implementation. I am currently working on another project, but we can help you further optimize the support for 70% unstructured sparsity later. |
Is it possible to use this in llama 2? I'm interested in improving the inference speed so the accuracy loss doesn't matter right now
The text was updated successfully, but these errors were encountered: