[Model] Jamba support #4115

mzusman · 2024-04-16T12:54:58Z

Add Jamba support to vLLM,
This PR comprises three parts:

The Jamba modeling file that encapsulates the Jamba model weights and logic itself and the mamba cache management.
Passage of the requests ids of the sequence groups and their sequence ids into the modeling file in order to be able to manage the cache.
Passage of the finished request ids into the modeling file as well in order the clean the allocated cache on finished requests

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida

* Remove assertion * adapting jamba vllm to changes after hf release, working on weight loading in modeling file * splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer * weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights * Add mamba from jamba modeling file * Remove slow forward * Modifications to mamba_mixer * Save changes, WIP * Fix cache placement * Debugging * Additions and logging * Jamba with mamba cache handling * Clean up * Another cleanup * Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with fused kernel * Clean up and orginization of the objects to handle the mamba cache * Shorten the code for kv cache mem * Move cache handling inside the Mixer * Add mamba to the wheel requirements * Add mamba to the requirements script * Add mamba_metadata * Add to __init__ __all__ * Revert 2 commits ad1a3db 'Add mamba to the requirements script' 75ed2c8 'Add mamba to the wheel requirements' * Clean up * Naming * Apply whitespace suggestions from code review * pass tie_word_embeddings to PretrainedConfig init * Replace repeat with expand as expand doesn't require more mem * Allocate really small cache if needed , don't use meta * Fix for expanded --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Erez Schwartz <[email protected]> Co-authored-by: tomeras91 <[email protected]>

* Drop indecies when finish * min 1 attention layer * CG is working on forward pass passing * Remove comments * cosmetics - rename indecies -> indices, organize some whitespaces * Add some TODOs * Adding mamba cache for cg * Remove useless vars from input_metadata * Remove unused import * Set the seqlen offset to boolean * Return only hidden state * Return only hidden states * Add padding to match forward pass bs * Is prompt instead of seqlen offset * Remove mamba cache class (not used) * Another remove * Remove * Use mamba4gc * Fix mamba forward, run update only on non prompt * Use 1 index after the maximal index * Remove import * Remove import * typo * typo * place holder * Padding and empty token takes it from the first empty place * reformat * Apply suggestions from code review Whitespaces --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Tomer Asida <[email protected]> Co-authored-by: tomeras91 <[email protected]>

Co-authored-by: Mor Zusman <[email protected]>

* Return support for other models apart from jamba * Support n>1 * A little cleanup * Rename * Apply whitespace suggestions from code review * Add max batch size to the main func * Fixed attention kv cache bug * log where requests id are deleted from the dict to debug mode * Fix typo * Align with v0.3.3 vllm code * Remove comments * Take out model config from CUDAGraph object * Fix * Fix typo * Make the kv cache selection cleaner * Another typo * Took the num layers calc outside * Remove the -1 * Set as num layer / period --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: tomeras91 <[email protected]>

* Return support for other models apart from jamba * Support n>1 * Revert 2 commits d054737 'Support n>1' b5167cc 'Return support for other models apart from jamba' * TP on input and output * Basic TP impl , working, correctness not working * TP is working * Roll back the verification that everything in the weights fits into the model * Cleanup * Use world size func * clean up * Import * Apply whitespace suggestions from code review * Organize imports * Add comment on the unsqueeze in conv1d * Organize and remove redundant code in forward pass * Remove print * Add comments Co-authored-by: tomeras91 <[email protected]> * White spaces * Set as A * better comment --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: tomeras91 <[email protected]>

robertgshaw2-neuralmagic · 2024-04-17T01:00:26Z

Cool!

vllm/config.py

vllm/core/scheduler.py

vllm/model_executor/models/__init__.py

vllm/core/scheduler.py

vllm/worker/cache_engine.py

vllm/worker/model_runner_base.py

vllm/worker/cpu_model_runner.py

vllm/worker/embedding_model_runner.py

vllm/worker/model_runner.py

vllm/worker/embedding_model_runner.py

tests/models/test_jamba.py

vllm/worker/cache_engine.py

vllm/core/scheduler.py

vllm/model_executor/models/jamba.py

explicitly declare vars

mzusman · 2024-06-30T13:32:39Z

AFAIU CI distributed-tests-2-gpus test fails regardless of this PR.

cadedaniel

This PR looks good to me. I will push to get it merged (we also want to use the new finished_request_ids bookkeeping in #5765 (comment)).

One edge-case that is missed is in the scheduler, where a certain lineage of a sequence can cause it to be finished without ever notifying the Worker. I would rather get this merged and fix that in another PR given it's so esoteric.

I left some nits; feel free to fix impactful ones now or in a follow up PR. Want to get this merged to avoid more rebase work.

I will get another reviewer for the requirements-mamba and Dockerfile. Other parts LGTM.

vllm/config.py

vllm/core/scheduler.py

vllm/worker/model_runner.py

vllm/core/scheduler.py

vllm/model_executor/models/jamba.py

WoosukKwon · 2024-07-01T05:46:39Z

QQ: Does this PR support parallel sampling (i.e., n > 1 in sampling params)? While I don't think it is not necessary to support parallel sampling in this PR, I'd like to know if this case was considered. Supporting parallel sampling might be a bit hard since it requires implementing copy_blocks for the Mamba cache. If this is not trivial, please leave a comment on the code.

mzusman · 2024-07-01T08:35:41Z

QQ: Does this PR support parallel sampling (i.e., n > 1 in sampling params)? While I don't think it is not necessary to support parallel sampling in this PR, I'd like to know if this case was considered. Supporting parallel sampling might be a bit hard since it requires implementing copy_blocks for the Mamba cache. If this is not trivial, please leave a comment on the code.

This PR does support parallel sampling, We transfer the seq_ids along with the requests_ids into the Jamba inner state mapping and copy the blocks accordingly, reference in the Jamba code

tlrmchlsmth

The PR looks good to me as well.

ErezSC42 and others added 28 commits April 16, 2024 10:13

dtype (vllm-project#6)

00bce1f

Co-authored-by: Mor Zusman <[email protected]>

After merge fixes

30e6dcd

Clean up

5c0efdc

Add release mamba cache to executor_base

19f11f3

Add jamba modifications

1fb817a

Add minimun 1 attention layer

30ae4a1

More fixes

7bd9c0a

Delete mamba cache

d5ac8e8

Jamba padding to the left

60b49b5

Clean up

c583fe8

Add import

c951b7d

Another clean up

da6d0f2

Align to main

eb79923

Fix reduce

919edba

Another fix

4668566

Black format for jamba

11a0737

Formatting

7e3415e

Formatting with format.sh

adbd2ae

Adding to docs and more

6daf2a2

Add to readme

7ee927b

Adding comments for prefill mamba

87fa299

Formating

8bca3b6

mzusman mentioned this pull request Apr 16, 2024

[New Model]: Jamba (MoE Mamba from AI21) #3690

Open

Mor Zusman added 2 commits June 27, 2024 15:18

Ignore jamba test in cpu

cd9ba35

Cleanup

6df4f69

tomeras91 reviewed Jun 27, 2024

View reviewed changes

Format and rename

75dd84e

tomeras91 reviewed Jun 27, 2024

View reviewed changes

vllm/worker/embedding_model_runner.py Outdated Show resolved Hide resolved

Format

577f678

mzusman marked this pull request as ready for review June 27, 2024 14:06

tlrmchlsmth reviewed Jun 27, 2024

View reviewed changes

tests/models/test_jamba.py Show resolved Hide resolved

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

vllm/core/scheduler.py Outdated Show resolved Hide resolved

vllm/model_executor/models/jamba.py Outdated Show resolved Hide resolved

simon-mo mentioned this pull request Jun 28, 2024

v0.5.1 Release Tracker #5806

Open

Mor Zusman added 10 commits June 30, 2024 00:54

change num_layers to num_attention_layers and add comment

7bb332e

Extended the finished reqeusts ids comment

c051758

Format and make the jamba code more readable, adding comments and

b6dc237

explicitly declare vars

Merge branch 'gh-main' into jamba-support-pr

24b4bf2

Format

b0b0836

Resolve conflicts and format

e52e4d7

Add finished requests ids to the prepare model spec decoding

b4d49e0

Format

68e27de

Test cleanup

670ff3a

Add message to test

b7e31e3

cadedaniel approved these changes Jul 1, 2024

View reviewed changes

Mor Zusman added 4 commits July 1, 2024 12:11

Add docstring in vllm/config.py

571f63d

rename flush to get_and_reset

49da326

Add comments

688732e

Change to private and check finished through all of the queue

4a6b170

tlrmchlsmth approved these changes Jul 1, 2024

View reviewed changes

Mor Zusman added 2 commits July 1, 2024 18:07

CI

2047a91

Merge branch 'gh-main' into jamba-support-pr

f2c407f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Jamba support #4115

[Model] Jamba support #4115

mzusman commented Apr 16, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Apr 17, 2024

mzusman commented Jun 30, 2024

cadedaniel left a comment

WoosukKwon commented Jul 1, 2024

mzusman commented Jul 1, 2024

tlrmchlsmth left a comment

[Model] Jamba support #4115

Are you sure you want to change the base?

[Model] Jamba support #4115

Conversation

mzusman commented Apr 16, 2024 • edited Loading

robertgshaw2-neuralmagic commented Apr 17, 2024

mzusman commented Jun 30, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jul 1, 2024

mzusman commented Jul 1, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mzusman commented Apr 16, 2024 •

edited

Loading