You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for the text decoder backbone of the new Gemma 3 (1B / 4B for edge) model.
The model architecture should slide right into our llama_transformer.py, with the exception of the interspersed sliding window local attention layers as specified in the technical report, which will require some modifications to our model code. Luckily the way this is done is through slicing the attention mask and not the kv cache, so we can keep using our static kv cache implementation. The local / global attention mechanism uses a ring buffer (dynamic kc cache) for the local layers. We will need to enable ring buffer on ET first.
🚀 The feature, motivation and pitch
Add support for the text decoder backbone of the new Gemma 3 (1B / 4B for edge) model.
The model architecture should slide right into our
llama_transformer.py
, with the exception of the interspersed sliding window local attention layers as specified in the technical report, which will require some modifications to our model code.Luckily the way this is done is through slicing the attention mask and not the kv cache, so we can keep using our static kv cache implementation.The local / global attention mechanism uses a ring buffer (dynamic kc cache) for the local layers. We will need to enable ring buffer on ET first.Checkpoints are on HuggingFace:
Optional - after adding Gemma 3, it should be pretty quick to add Gemma 2 2B as well, which is a pretty popular edge model in the local LLM community.
RFC (Optional)
#8228
cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel
The text was updated successfully, but these errors were encountered: