Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Am I using minference correctly? #83

Open
YLGH opened this issue Oct 30, 2024 · 2 comments
Open

[Question]: Am I using minference correctly? #83

YLGH opened this issue Oct 30, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@YLGH
Copy link

YLGH commented Oct 30, 2024

Describe the issue

Hi,

I followed the instructions for running minference on HF and ran an example where I give it the full text of Dante's inferno - in Italian. As well as the book of the harry potter series, and then ask it a few questions. I'm testing this on the llama 3p1 8b instruct model, but with the config modified so the sequence length is 262k.

https://gist.github.com/YLGH/2b70d6ed10a6b5ea97404cb2668e24f3

The output when using minference attention seems to be completely off. It doesn't acknowledge the existence of Dante's inferno at all, and says that I gave it books 1 and 2 of harry potter.

I also ran the same prompt through full dense attention, and it's able to distinguish the two.

Am I using the HF example correctly?

Thanks!

@YLGH YLGH added the question Further information is requested label Oct 30, 2024
@iofu728 iofu728 self-assigned this Nov 4, 2024
@YLGH
Copy link
Author

YLGH commented Nov 11, 2024

Hi, bump on this -

Also curious if it's related to some observations that some attention heads are fully dense (like duo attention? Perhaps this is something that the benchmarks don't measure well?

@iofu728
Copy link
Contributor

iofu728 commented Nov 12, 2024

Hi @YLGH, sorry, I haven't had a chance to check the previous issues yet, but I can provide a quick answer to your question.

For methods like DuoAttention and RazorAttention, I think they’re quite reasonable. First, a head-level hybrid sparsity approach makes a lot of sense—some heads should be able to handle tasks with only an A-shape pattern. This similar approach is also used in pretrained LLMs, such as those from Character.AI and Yi-lightning, which shows its effectiveness.

However, from my perspective, this approach is not fully optimized. The main reason is that attention heads are inherently very sparse, regardless of the specific head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants