-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Llama-3_1-Nemotron-51B #10669
Conversation
I wonder if it's also a better idea not to group this with the normal llama archs since it requires so many changes, may be better to make it its own model type? |
I think src/llama.cpp doesn't change that much but convert_hf_to_gguf.py does have more changes. Anyway, I can make another fork to make it a separate model type and submit another pull request. What do you think about the vocab.py problem? Should I just leave the original vocab.py as is and ask people to fix tokenizer_config.json instead? |
Created a separate Deci Model. This version doesn't change vocab.py and relies on people manually fixing 51B model's tokenizer_config.json. |
Any updates? |
src/llama.cpp
Outdated
if (n_head == 0) // attention-free layer of Llama-3_1-Nemotron-51B | ||
cur = inpL; | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (n_head == 0) // attention-free layer of Llama-3_1-Nemotron-51B | |
cur = inpL; | |
else { | |
if (n_head == 0) { // attention-free layer of Llama-3_1-Nemotron-51B | |
cur = inpL; | |
} else { |
src/llama.cpp
Outdated
} else if (n_head > 0) | ||
// self-attention | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (n_head > 0) | |
// self-attention | |
{ | |
} else if (n_head > 0) { | |
// self-attention |
To fix editorconfig / flake8 tests, you need to modify your source code to remove trailing spaces / add new line. And to fix server CI, you need to merge latest commits from master branch |
Can someone approve the workflows? |
Yay! Finally passed all checks! :) |
src/llama.cpp
Outdated
cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wo, cur); | ||
cb(cur, "wo", il); | ||
} else if (n_head > 0) { | ||
// self-attention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// self-attention | |
// self-attention |
src/llama.cpp
Outdated
const int64_t n_head_kv = hparams.n_head_kv(il); | ||
const int64_t n_head = hparams.n_head(il); | ||
|
||
if (n_head == 0) { // attention-free layer of Llama-3_1-Nemotron-51B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (n_head == 0) { // attention-free layer of Llama-3_1-Nemotron-51B | |
if (n_head == 0) { | |
// attention-free layer of Llama-3_1-Nemotron-51B |
src/llama.cpp
Outdated
cb(cur, "attn_norm", il); | ||
} | ||
|
||
if (n_head > 0 && n_head_kv == 0) { // "linear attention" of Llama-3_1-Nemotron-51B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (n_head > 0 && n_head_kv == 0) { // "linear attention" of Llama-3_1-Nemotron-51B | |
if (n_head > 0 && n_head_kv == 0) { | |
// "linear attention" of Llama-3_1-Nemotron-51B |
Make sure to read the contributing guidelines before submitting a PR
More details is here:
#10648
Seems like my changes in vocab.py doesn't really break CI test.
It might be a better idea to not to modify vocab.py but instead ask the user to fix the tokenizer_config.json instead. In that case, you can ignore the changes I made in vocab.py.