Bamba architecture #10810

gabe-l-hart · 2024-12-12T21:47:30Z

Description

This PR adds support for the forthcoming bamba architecture from IBM Research. It is a hybrid SSM architecture which is similar in nature to jamba, but uses the mamba2 mixer instead of the original mamba mixer.

Dependencies

This PR is based on in-flight work for this new model which will be published soon. There are other in-flight PRs for this model:

transformers: Add the Bamba Model huggingface/transformers#34982
vllm: Bamba VLLM Draft fabianlim/vllm#2

NOTE: In order to run the conversion steps, you will need to install the branch of transformers from the above PR until it is merged and released.

TODOs

Figure out trajectory for the mamba2 PR
Fix support on metal
Ensure support with cuda

Changes

This PR comes with some big changes in addition to the standard pattern for contributing a new architecture. The need for these big changes arises from how the KV cache is used differently for attention layers versus recurrent layers.

Library Interface

Add llama_model_is_hybrid. This mirrors the llama_model_is_recurrent function and currently only returns true for Bamba

`llama.cpp` project architecture

llama_hparams:
- Add recurrent_layer_arr to support per-layer indicator for recurrence
- Add bool recurrent_layer(uint32_t il) const to index into recurrent_layer_arr
- Update n_embd_k_s and n_embd_v_s to be per-layer and return 0 for non-recurrent layers
llama_context:
- Add struct llama_kv_cache kv_hybrid as a secondary KV cache for hybrid models. This is the biggest structural change that allows the recurrent and attention layers to operate independently in the cache. For non-hybrid models, this will never be initialized.
llama_kv_cache_init:
- Add the recurrent argument. This allows recurrent models to initialize a non-recurrent cache (i.e. the attention cache for hybrid models)
- Determine the size of the cache tensors on a per-layer basis so that layers which are managed by the other cache in a hybrid model have zero size (i.e. in the kv_self, all recurrent layers will have zero size, and in kv_hybrid, all attention layers will have zero size)
llm_load_hparams:
- Automatically fill recurrent_layer_arr based on llama_model_is_recurrent(&model). For hybrid models, it falls to the individual model architecture construction to populate the per-layer entries correctly.
llm_build_mamba2:
- Allow kv_hybrid as the cache that gets initialized instead of kv_self. This is determined based on a new hybrid flag passed by the build_<arch> implementation.
llm_build_context:
- Add replicas of all KV-related members to support the second kv_hybrid cache (kv_hybrid, n_kv_hybrid, kv_head_hybrid, rs_zero_hybrid)
- Populate all of ^ based on lctx.kv_hybrid
llama_set_inputs:
- Do the recurrent branch for hybrid models and use kv_self or kv_hybrid as appropriate (kv_hybrid will be the recurrent cache for hybrid models)
llama_decode_internal:
- Use simple_split for hybrid models as well, even if kv_self.recurrent is false
- Add a second llama_kv_slot_restorer for kv_hybrid and perform the save/restore operations if (and only if) the model is hybrid
- Update kv_hybrid ring buffer when needed
- Defrag the kv_hybrid cache for hybrid models
llama_new_context_with_model:
- Manage kv_size and kv_size_hybrid independently
- Init kv_self as recurrent IFF the model is recurrent but not hybrid
- Init the kv_hybrid cache for hybrid models

Model Architecture

Add architecture enum and layer set as normal
Add hparam for LLM_KV_ATTENTION_LAYER_INDICES ("%s.attention.layer_indices") to indicate per-layer list of which layers use attention
- NOTE: Some hybrid models use a period and offset rather than an explicit list, but the list is the most flexible so I opted to only use that at the llama.cpp layer and require conversion from period/offset to a list in the conversion script
Add hparam LLM_KV_SSM_HEAD_DIM ("%s.ssm.head_dim") to set the head_dim from config rather than deducing it from d_inner / n_head
Add model enum entry in llm_load_tensors
Add build_bamba to construct the graph
Mark LLM_ARCH_BAMBA as hybrid in llama_model_is_hybrid

Conversion

Add Keys.HybridMamba section in constants.py to hold hybrid model parameters
- NOTE: I'm torn on whether this should be Keys.HybridMamba, Keys.HybridRecurrent, or just Keys.Hybrid
Add new hparams with plumbing for ssm_head_dim and attn_layer_indices in constants.py and gguf_writer.py
Update tensor_mapping.py names for Bamba layer names
Add class BambaModel in convert_hf_to_gguf.py and base it on Mamba2Model

Open Questions

What are the plans for the current mamba2 PR?
Is there a better approach to handle they hybrid KV caching? I tried going the route of adding additional pointers into the single kv_self, but kept getting hung up on kv_self.size being different for the two different types of caching.
Are there other places where the KV cache is used that should be updated to also support kv_hybrid?

* ggml : improve ggml_mul speed when masking recurrent states

* ggml : make the ggml_mul fast broadcast path more consistently formatted

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

There are likely still some missing hparams, but the tensor mapping should be correct Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <[email protected]>