Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-3_1-Nemotron 51B support #726

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

ymcki
Copy link

@ymcki ymcki commented Jan 28, 2025

Dear all,

I am the person who added Llama-3_1-Nemotron-51B support to llama.cpp.

ggml-org/llama.cpp#10669

I tried to add this support to exllamav2 and came up with a hack that
can convert and infer Llama-3_1-Nemotron-51B.

While the hack is working, I am not sure if it is the best way to implement it
as it changes quite a lot of code at exllamav2.

This is because the current exllamav2 codebase is not designed for the case
when different layers of an llm can have different number of key_value_heads and
different structures as in the case of DeciLMForCausalLM (this 51B model) and Apple's
OpenELMForCausalLM.

For this 51B model, there are three types of layers:

  1. Normal layer that is same as the llama3 model it is based on
  2. A linear attention layer that has no q_proj, k_proj, v_proj and o_proj but has
    a linear_attn which is simply matmuled with input.
  3. An attention-free layer that is simply a MLP layer.

Also, for this model, number of kv_heads and intermediate_size can be different
for different layers.

As a result, there are quite a lot of changes to the code in my fork. I also added
a file called linear_attn.py to define ExLlamaV2LinearAttention to handle the
linear attention layer.

While it can run without errors based on my testing so far, I am not sure if it
covers all situations. Maybe it will be better waiting for a rewrite that accomodates
these per layer variable models like DeciLMForCausalLM and OpenELMForCausalLM.

It would be great if this hack can serve as a starting point for such a rewrite and
allow me to add the support later for a cleaner contribution.

Thank you very much for your time.

@ymcki
Copy link
Author

ymcki commented Jan 29, 2025

Removed linear_attn.py by merging the ExLlamaV2LinearAttention class into ExLlamaV2Attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant