llama : add support for Cohere2ForCausalLM #10900

dranger003 · 2024-12-19T14:50:35Z

Cohere updated their Command-R model architecture for C4AI Command R7B requiring an update to llama.cpp. Looking at the HF code, it looks like the model is using a hybrid cache like Gemma2. Additional info from their model page on HF:

The model features three layers with sliding window attention (window size 4096) and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.

Summary changes in this PR (based on my very limited knowledge of neural nets):

Add sliding window and RoPE dim count during conversion
Remove ATTN_K_NORM and ATTN_Q_NORM
Support alternating sliding window attention in build_cohere2 (looking at llama.cpp's build_gemma2) using pattern of 4 layers
Use LLAMA_ROPE_TYPE_NORM as the rope type

HF transformers implementation reference:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere2/modular_cohere2.py

Test weights:
https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF

dranger003 · 2024-12-19T15:16:01Z

HF config.json:

{
  "architectures": [
    "Cohere2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 5,
  "cache_implementation": "hybrid",
  "eos_token_id": 255001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "layer_norm_eps": 1e-05,
  "layer_switch": 4,
  "logit_scale": 0.25,
  "max_position_embeddings": 8192,
  "model_type": "cohere2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "order_of_interleaved_layers": "local_attn_first",
  "pad_token_id": 0,
  "position_embedding_type": "rope_gptj",
  "rope_scaling": null,
  "rope_theta": 50000,
  "rotary_pct": 1.0,
  "sliding_window": 4096,
  "sliding_window_pattern": 4,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.0.dev0",
  "use_cache": true,
  "use_embedding_sharing": true,
  "use_gated_activation": true,
  "use_parallel_block": true,
  "use_parallel_embedding": true,
  "vocab_size": 256000
}

dranger003 · 2024-12-19T15:25:38Z

Info from @foldl:

It uses (3 SWA layers + 1 global attention layer). So, build_command_r need to be updated, even though the result seems promising.

Here is an implementation of interleaved SWA/global-attention layers.

https://github.com/foldl/chatllm.cpp/blob/ff54a787948f02151b38231375be042b632a271e/models/cohere.cpp#L246C1-L258C1

dranger003 · 2024-12-19T18:13:02Z

convert_hf_to_gguf.py

+class Cohere2Model(Model):
+    model_arch = gguf.MODEL_ARCH.COHERE2
+
+    def set_gguf_parameters(self):


The config.json has "max_position_embeddings": 8192, but the model supports 128K context. Do we need to adjust this value here?

Don't quote me on this but I think it's fine to leave this as-is and force users to adjust rope settings to enable the full context

src/llama.cpp

dranger003 · 2024-12-19T21:26:49Z

src/llama.cpp

+                    cb(Vcur, "Vcur", il);
+                }
+
+                Qcur = ggml_rope_ext(ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr,


Do we need to use build_rope_factors(il) for c when calling ggml_rope_ext with this model?

RoPE is only applied to SWA layers.

Got it, looks like the cache is working now. Not sure if I still need build_rope_factors() though?

github-actions bot added the python python script changes label Dec 19, 2024

dranger003 marked this pull request as draft December 19, 2024 15:12

dranger003 commented Dec 19, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

dranger003 commented Dec 19, 2024

View reviewed changes

dranger003 marked this pull request as ready for review December 20, 2024 00:26

Add support for the cohere2 model architecture.

2116f48

dranger003 force-pushed the cohere2 branch from 5999fdc to 2116f48 Compare December 20, 2024 01:19

dranger003 changed the title ~~Add support for Cohere2ForCausalLM~~ llama : add support for Cohere2ForCausalLM Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add support for Cohere2ForCausalLM #10900

llama : add support for Cohere2ForCausalLM #10900

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024

dranger003 Dec 19, 2024

bartowski1182 Dec 20, 2024

dranger003 Dec 19, 2024

foldl Dec 19, 2024

dranger003 Dec 19, 2024 •

edited

Loading

llama : add support for Cohere2ForCausalLM #10900

Are you sure you want to change the base?

llama : add support for Cohere2ForCausalLM #10900

Conversation

dranger003 commented Dec 19, 2024 • edited Loading

dranger003 commented Dec 19, 2024 • edited Loading

dranger003 commented Dec 19, 2024

dranger003 Dec 19, 2024

Choose a reason for hiding this comment

bartowski1182 Dec 20, 2024

Choose a reason for hiding this comment

dranger003 Dec 19, 2024

Choose a reason for hiding this comment

foldl Dec 19, 2024

Choose a reason for hiding this comment

dranger003 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 commented Dec 19, 2024 •

edited

Loading

dranger003 Dec 19, 2024 •

edited

Loading