You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I followed the instructions for running minference on HF and ran an example where I give it the full text of Dante's inferno - in Italian. As well as the book of the harry potter series, and then ask it a few questions. I'm testing this on the llama 3p1 8b instruct model, but with the config modified so the sequence length is 262k.
The output when using minference attention seems to be completely off. It doesn't acknowledge the existence of Dante's inferno at all, and says that I gave it books 1 and 2 of harry potter.
I also ran the same prompt through full dense attention, and it's able to distinguish the two.
Am I using the HF example correctly?
Thanks!
The text was updated successfully, but these errors were encountered:
Also curious if it's related to some observations that some attention heads are fully dense (like duo attention? Perhaps this is something that the benchmarks don't measure well?
Hi @YLGH, sorry, I haven't had a chance to check the previous issues yet, but I can provide a quick answer to your question.
For methods like DuoAttention and RazorAttention, I think they’re quite reasonable. First, a head-level hybrid sparsity approach makes a lot of sense—some heads should be able to handle tasks with only an A-shape pattern. This similar approach is also used in pretrained LLMs, such as those from Character.AI and Yi-lightning, which shows its effectiveness.
However, from my perspective, this approach is not fully optimized. The main reason is that attention heads are inherently very sparse, regardless of the specific head.
Describe the issue
Hi,
I followed the instructions for running minference on HF and ran an example where I give it the full text of Dante's inferno - in Italian. As well as the book of the harry potter series, and then ask it a few questions. I'm testing this on the llama 3p1 8b instruct model, but with the config modified so the sequence length is 262k.
https://gist.github.com/YLGH/2b70d6ed10a6b5ea97404cb2668e24f3
The output when using minference attention seems to be completely off. It doesn't acknowledge the existence of Dante's inferno at all, and says that I gave it books 1 and 2 of harry potter.
I also ran the same prompt through full dense attention, and it's able to distinguish the two.
Am I using the HF example correctly?
Thanks!
The text was updated successfully, but these errors were encountered: