Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dear all,
I am the person who added Llama-3_1-Nemotron-51B support to llama.cpp.
ggml-org/llama.cpp#10669
I tried to add this support to exllamav2 and came up with a hack that
can convert and infer Llama-3_1-Nemotron-51B.
While the hack is working, I am not sure if it is the best way to implement it
as it changes quite a lot of code at exllamav2.
This is because the current exllamav2 codebase is not designed for the case
when different layers of an llm can have different number of key_value_heads and
different structures as in the case of DeciLMForCausalLM (this 51B model) and Apple's
OpenELMForCausalLM.
For this 51B model, there are three types of layers:
a linear_attn which is simply matmuled with input.
Also, for this model, number of kv_heads and intermediate_size can be different
for different layers.
As a result, there are quite a lot of changes to the code in my fork. I also added
a file called linear_attn.py to define ExLlamaV2LinearAttention to handle the
linear attention layer.
While it can run without errors based on my testing so far, I am not sure if it
covers all situations. Maybe it will be better waiting for a rewrite that accomodates
these per layer variable models like DeciLMForCausalLM and OpenELMForCausalLM.
It would be great if this hack can serve as a starting point for such a rewrite and
allow me to add the support later for a cleaner contribution.
Thank you very much for your time.