Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorpopration with llama 2? #8

Open
BDHU opened this issue Sep 27, 2023 · 5 comments
Open

Incorpopration with llama 2? #8

BDHU opened this issue Sep 27, 2023 · 5 comments

Comments

@BDHU
Copy link

BDHU commented Sep 27, 2023

Is it possible to use this in llama 2? I'm interested in improving the inference speed so the accuracy loss doesn't matter right now

@JamesTheZ
Copy link
Collaborator

I believe there is no problem with using Flash-LLM kernel on llama 2. Flash-LLM mainly consists of the high-performance SpMM GPU kernel, which should be efficient for all existing LLM inference MatMul shapes.

@YixinSong-e
Copy link

YixinSong-e commented Oct 2, 2023

Does llama 2 have such high unstructure sparsity? And can our method be combined with quantification?

@BDHU
Copy link
Author

BDHU commented Oct 3, 2023

Does llama 2 have such high unstructure sparsity? And can our method be combined with quantification?

We do have ongoing research that achieves 70% unstructured sparsity on llama 2 with negligible accuracy loss, that's why we want to see the speed gain from removing those weights.

@YixinSong-e
Copy link

That's

Does llama 2 have such high unstructure sparsity? And can our method be combined with quantification?

We do have ongoing research that achieves 70% unstructured sparsity on llama 2 with negligible accuracy loss, that's why we want to see the speed gain from removing those weights.

That's amazing! To my knowledge, The sparsegpt and Wanda methods significantly increase ppl when llama has 70% sparsity

@Summer-Summer
Copy link
Collaborator

Does llama 2 have such high unstructure sparsity? And can our method be combined with quantification?

We do have ongoing research that achieves 70% unstructured sparsity on llama 2 with negligible accuracy loss, that's why we want to see the speed gain from removing those weights.

That is amazing! For 70% unstructured sparsity, I believe even better performance can be achieved compared to our existing implementation. I am currently working on another project, but we can help you further optimize the support for 70% unstructured sparsity later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants