[MoE/ZeRO] Moe refactor with zero refactor #5821

Hz188 · 2024-06-14T10:13:47Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* cherry pick from refractor-moe branch * tests passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support ep + zero --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…b workflow

…ayer and remove useless test

[Feauture] MoE refactor

* [zero] refactor low level optimizer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix/Example] Fix Llama Inference Loading Data Type (#5763) * [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3 * [release] update version (#5752) * [release] update version * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [test] fix ddp plugin test * [test] fix gptj and rpc test * [devops] fix cuda ext compatibility * [inference] fix flash decoding test * [inference] fix flash decoding test * fix (#5765) * [test] Fix/fix testcase (#5770) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [Hotfix] Add missing init file in inference.executor (#5774) * [CI/tests] simplify some test case to reduce testing time (#5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost * [misc] update dockerfile (#5776) * [misc] update dockerfile * [misc] update dockerfile * [devops] fix docker ci (#5780) * [Inference]Add Streaming LLM (#5745) * Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist * [hotfix] fix llama flash attention forward (#5777) * [misc] Accelerate CI for zero and dist optim (#5758) * remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <[email protected]> * [Test/CI] remove test cases to reduce CI duration (#5753) * [test] smaller gpt2 test case * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py * Revert "[test] smaller gpt2 test case" Some tests might depend on the size of model (num of chunks) This reverts commit df705a5. * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py * [CI] smaller test model for two mwo the two modifid cases * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there * [hotfix] fix testcase in test_fx/test_tracer (#5779) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [fix] fix test_deepfm_model & test_dlrf_model； * [fix] fix test_hf_albert & test_hf_gpt; * [gemini] optimize reduce scatter d2h copy (#5760) * [gemini] optimize reduce scatter d2h copy * [fix] fix missing reduce variable * [refactor] remove legacy async reduce scatter code * [gemini] missing sync * Revert "[refactor] remove legacy async reduce scatter code" This reverts commit 58ad76d. * [gemini] further optimize with async all reduce * [fix] pass flag from manager to chunk * Allow building cuda extension without a device. (#5535) Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are. * [misc] fix dist logger (#5782) * [install]fix setup (#5786) * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update requirements (#5787) * [shardformer] fix import (#5788) * upgrade colossal-chat support tp_group>1, add sp for sft * upgrade ppo dpo rm script * run pre-commit * moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy * fix training script * fix ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix transformers version * remove duplicated test * fix datasets version * remove models that require huggingface auth from ci * remove local data path * update ci * remove baichuan from template test due to transformer version conflict * merge * Refactor modeling by adding attention backend Signed-off-by: char-1ee <[email protected]> * Fix tests and naming Signed-off-by: char-1ee <[email protected]> * Pass inference model shard configs for module init Signed-off-by: char-1ee <[email protected]> * Clean up Signed-off-by: char-1ee <[email protected]> * replace the customized dataloader setup with the build-in one * replace the customized dataloader setup with the build-in one * Remove flash attention backend Signed-off-by: char-1ee <[email protected]> * fix readme * Fix test import Signed-off-by: char-1ee <[email protected]> * update sft trainning script * [Inference]refactor baichuan (#5791) * refactor baichuan * remove unused code and add TODO for lazyinit * [test] fix chatglm test kit (#5793) * [shardformer] fix modeling of bloom and falcon (#5796) * [test] fix qwen2 pytest distLarge (#5797) * [Inference] Fix flash-attn import and add model test (#5794) * Fix torch int32 dtype Signed-off-by: char-1ee <[email protected]> * Fix flash-attn import Signed-off-by: char-1ee <[email protected]> * Add generalized model test Signed-off-by: char-1ee <[email protected]> * Remove exposed path to model Signed-off-by: char-1ee <[email protected]> * Add default value for use_flash_attn Signed-off-by: char-1ee <[email protected]> * Rename model test Signed-off-by: char-1ee <[email protected]> --------- Signed-off-by: char-1ee <[email protected]> * [Gemini] Use async stream to prefetch and h2d data moving (#5781) * use async stream to prefetch and h2d data moving * Remove redundant code * [gemini] quick fix on possible async operation (#5803) * [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation * [shardformer] upgrade transformers to 4.39.3 (#5815) * [shardformer]upgrade transformers for gpt2/gptj/whisper (#5807) * [shardformer] fix modeling of gpt2 and gptj * [shardformer] fix whisper modeling * [misc] update requirements --------- Co-authored-by: ver217 <[email protected]> * [shardformer]upgrade transformers for mistral (#5808) * upgrade transformers for mistral * fix * fix * [shardformer]upgrade transformers for llama (#5809) * update transformers fix * fix * fix * [inference] upgrade transformers (#5810) * update transformers fix * fix * fix * fix * fix * [gemini] update transformers for gemini (#5814) --------- Co-authored-by: ver217 <[email protected]> * Support 4d parallel + flash attention (#5789) * support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <[email protected]> --------- Signed-off-by: char-1ee <[email protected]> Co-authored-by: Yuanheng Zhao <[email protected]> Co-authored-by: Hongxin Liu <[email protected]> Co-authored-by: flybird11111 <[email protected]> Co-authored-by: duanjunwen <[email protected]> Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: botbw <[email protected]> Co-authored-by: Charles Coulombe <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: char-1ee <[email protected]> Co-authored-by: Runyu Lu <[email protected]> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Guangyao Zhang <[email protected]>

* [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back

* [zero] modify api * [test] remove _grad_store access in tests

colossalai/zero/low_level/bookkeeping/gradient_store.py

colossalai/zero/low_level/low_level_optim.py

colossalai/booster/plugin/moe_hybrid_parallel_plugin.py

colossalai/cluster/process_group_mesh.py

colossalai/checkpoint_io/moe_checkpoint.py

tests/test_moe/test_moe_checkpoint.py

…ve logger into function

FrankLeeeee and others added 30 commits May 29, 2024 16:39

[moe] removed openmoe-coupled code and rectify mixstral code (#5471)

f1d4167

add mixtral auto policy & move pipeline forward code to modeling folder

d49fd63

[moe refactor] modify kernel test without Route Class

d2e07fc

[moe refactor] add moe tensor test path environment variable to githu…

7556b8f

…b workflow

fix typos

16329d5

fix moe test bug due to the code rebase

b934437

[moe refactor] fix moe zero test, and little bug in low level zero

a792e83

fix typo

d203ba8

add moe tensor path to github workflow

55c7416

remove some useless code

8915e9d

fix typo & unify global variable XX_AXIS logic without using -1

7963fb0

fix typo & prettifier the code

32ced74

remove print code & support zero 2 test

3100c1b

remove useless code

928ee39

reanme function

6dc0cfc

fix typo

4417840

fix typo

eb35655

Further improve the test code

d1d446b

remove print code

09a5188

[moe refactor] change test model from fake moe model to mixtral moe l…

4c6ea42

…ayer and remove useless test

[moe refactor] skip some unit test which will be refactored later

80b6586

[moe refactor] fix unit import error

7d06220

[moe refactor] fix circular import issues

fb41f42

[moe refactor] remove debug code

e99b69c

[moe refactor] update github workflow

af9ade6

Merge pull request #5775 from Hz188/feature/moe

49d74f3

[Feauture] MoE refactor

[Feature] MoE refactor with newest version of ZeRO (#5801)

88f318a

[zero] remove redundant members in BucketStore (#5802)

b2ac7e5

botbw and others added 7 commits June 17, 2024 17:08

[zero] fix missing hook removal (#5824)

4cd4a1f

[zero] fix hook bug

d9ea6d4

Merge branch 'main' into feature/moe

b04e99c

[zero] add low level optimizer back (#5839)

62cd25d

* [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back

[zero] comments and naming (#5840)

204d25c

[zero] modify api (#5843)

efdfa06

* [zero] modify api * [test] remove _grad_store access in tests

Hz188 self-assigned this Jun 25, 2024

botbw and others added 3 commits June 26, 2024 11:08

[test] fix (#5857)

44aeccc

[CI] skip openmoe CI check

9398484

[CI] fox pre-commit

5e551f8

ver217 reviewed Jun 27, 2024

View reviewed changes

colossalai/zero/low_level/bookkeeping/gradient_store.py Show resolved Hide resolved

colossalai/zero/low_level/low_level_optim.py Outdated Show resolved Hide resolved

[zero] remove redundant memebr init (#5862)

2ff332c

ver217 reviewed Jun 27, 2024

View reviewed changes

colossalai/checkpoint_io/moe_checkpoint.py Outdated Show resolved Hide resolved

colossalai/checkpoint_io/moe_checkpoint.py Outdated Show resolved Hide resolved

ver217 reviewed Jun 27, 2024

View reviewed changes

tests/test_moe/test_moe_checkpoint.py Outdated Show resolved Hide resolved

Hz188 and others added 4 commits June 27, 2024 08:52

[misc] remove useless code, modify the pg mesh implementation

75be843

Merge branch 'hpcaitech:feature/moe' into feature/moe

1855442

[misc] remove useless code, modify the pg mesh implementation

3a25166

[misc] use tempfile

502e514

Hz188 force-pushed the feature/moe branch from b606612 to 502e514 Compare June 27, 2024 10:27

Hz188 added 3 commits June 27, 2024 11:49

resolve conflict with main branch

494b8a2

resolve conflict with main branch

961e96f

[misc] use tempfile in test_moe_checkpoint.py

95c4c0b

Hz188 changed the title ~~[MoE/ZeRO] Moe refactor with newest version of low level zero~~ [MoE/ZeRO] Moe refactor with zero refactor Jun 27, 2024

Hz188 added 2 commits June 28, 2024 03:47

[misc] remove useless code, add assertion about sequence parallel, mo…

9e966b9

…ve logger into function

[misc] remove useless code

165e894

ver217 approved these changes Jun 28, 2024

View reviewed changes

ver217 merged commit 416580b into main Jun 28, 2024
7 checks passed

ver217 deleted the feature/moe branch June 28, 2024 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE/ZeRO] Moe refactor with zero refactor #5821

[MoE/ZeRO] Moe refactor with zero refactor #5821

Hz188 commented Jun 14, 2024

[MoE/ZeRO] Moe refactor with zero refactor #5821

[MoE/ZeRO] Moe refactor with zero refactor #5821

Conversation

Hz188 commented Jun 14, 2024

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?