-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MoE/ZeRO] Moe refactor with zero refactor #5821
Commits on May 29, 2024
-
Configuration menu - View commit details
-
Copy full SHA for f1d4167 - Browse repository at this point
Copy the full SHA f1d4167View commit details -
[Feauture] MoE refractor; Intergration with Mixtral (#5682)
* cherry pick from refractor-moe branch * tests passed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support ep + zero --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for df6826d - Browse repository at this point
Copy the full SHA df6826dView commit details
Commits on May 31, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d49fd63 - Browse repository at this point
Copy the full SHA d49fd63View commit details
Commits on Jun 4, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d2e07fc - Browse repository at this point
Copy the full SHA d2e07fcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7556b8f - Browse repository at this point
Copy the full SHA 7556b8fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 16329d5 - Browse repository at this point
Copy the full SHA 16329d5View commit details
Commits on Jun 5, 2024
-
Configuration menu - View commit details
-
Copy full SHA for b934437 - Browse repository at this point
Copy the full SHA b934437View commit details
Commits on Jun 6, 2024
-
Configuration menu - View commit details
-
Copy full SHA for a792e83 - Browse repository at this point
Copy the full SHA a792e83View commit details -
Configuration menu - View commit details
-
Copy full SHA for d203ba8 - Browse repository at this point
Copy the full SHA d203ba8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 55c7416 - Browse repository at this point
Copy the full SHA 55c7416View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8915e9d - Browse repository at this point
Copy the full SHA 8915e9dView commit details
Commits on Jun 7, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 7963fb0 - Browse repository at this point
Copy the full SHA 7963fb0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 32ced74 - Browse repository at this point
Copy the full SHA 32ced74View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3100c1b - Browse repository at this point
Copy the full SHA 3100c1bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 928ee39 - Browse repository at this point
Copy the full SHA 928ee39View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6dc0cfc - Browse repository at this point
Copy the full SHA 6dc0cfcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4417840 - Browse repository at this point
Copy the full SHA 4417840View commit details -
Configuration menu - View commit details
-
Copy full SHA for eb35655 - Browse repository at this point
Copy the full SHA eb35655View commit details -
Configuration menu - View commit details
-
Copy full SHA for d1d446b - Browse repository at this point
Copy the full SHA d1d446bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 09a5188 - Browse repository at this point
Copy the full SHA 09a5188View commit details
Commits on Jun 11, 2024
-
[moe refactor] change test model from fake moe model to mixtral moe l…
…ayer and remove useless test
Configuration menu - View commit details
-
Copy full SHA for 4c6ea42 - Browse repository at this point
Copy the full SHA 4c6ea42View commit details -
Configuration menu - View commit details
-
Copy full SHA for 80b6586 - Browse repository at this point
Copy the full SHA 80b6586View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7d06220 - Browse repository at this point
Copy the full SHA 7d06220View commit details -
Configuration menu - View commit details
-
Copy full SHA for fb41f42 - Browse repository at this point
Copy the full SHA fb41f42View commit details -
Configuration menu - View commit details
-
Copy full SHA for e99b69c - Browse repository at this point
Copy the full SHA e99b69cView commit details
Commits on Jun 12, 2024
-
Configuration menu - View commit details
-
Copy full SHA for af9ade6 - Browse repository at this point
Copy the full SHA af9ade6View commit details -
Merge pull request #5775 from Hz188/feature/moe
[Feauture] MoE refactor
Configuration menu - View commit details
-
Copy full SHA for 49d74f3 - Browse repository at this point
Copy the full SHA 49d74f3View commit details -
[moe/zero] refactor low level optimizer (#5767)
* [zero] refactor low level optimizer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for d71ab10 - Browse repository at this point
Copy the full SHA d71ab10View commit details -
Configuration menu - View commit details
-
Copy full SHA for 88f318a - Browse repository at this point
Copy the full SHA 88f318aView commit details -
Configuration menu - View commit details
-
Copy full SHA for b2ac7e5 - Browse repository at this point
Copy the full SHA b2ac7e5View commit details
Commits on Jun 13, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 346a0df - Browse repository at this point
Copy the full SHA 346a0dfView commit details
Commits on Jun 14, 2024
-
Merge pull request #5811 from botbw/moe
[zero] remove redundant members in BucketStore
Configuration menu - View commit details
-
Copy full SHA for a3a7d7d - Browse repository at this point
Copy the full SHA a3a7d7dView commit details -
[Moe/Zero] Update MoeHybridParallelPlugin with refactored ZeRO and Fi…
…x Zero bug (#5819) * [moe refactor] update unit test with the refactored ZeRO and remove useless test * move moe checkpoint to checkpoint folder and exchange global axis to class member * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug * fix zero unit test * Add an assertion to prevent users from using it incorrectly
Configuration menu - View commit details
-
Copy full SHA for ba0115a - Browse repository at this point
Copy the full SHA ba0115aView commit details
Commits on Jun 17, 2024
-
[hotfix]Solve the compatibility issue of zero refactor (#5823)
* [moe refactor] update unit test with the refactored ZeRO and remove useless test * move moe checkpoint to checkpoint folder and exchange global axis to class member * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug * fix zero unit test * Add an assertion to prevent users from using it incorrectly * Modify function parameter names to resolve compatibility issues
Configuration menu - View commit details
-
Copy full SHA for a10802e - Browse repository at this point
Copy the full SHA a10802eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4cd4a1f - Browse repository at this point
Copy the full SHA 4cd4a1fView commit details
Commits on Jun 19, 2024
-
[MoE] Resolve .github conflict (#5829)
* [Fix/Example] Fix Llama Inference Loading Data Type (#5763) * [fix/example] fix llama inference loading dtype * revise loading dtype of benchmark llama3 * [release] update version (#5752) * [release] update version * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [devops] update compatibility test * [test] fix ddp plugin test * [test] fix gptj and rpc test * [devops] fix cuda ext compatibility * [inference] fix flash decoding test * [inference] fix flash decoding test * fix (#5765) * [test] Fix/fix testcase (#5770) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [Hotfix] Add missing init file in inference.executor (#5774) * [CI/tests] simplify some test case to reduce testing time (#5755) * [ci/tests] simplify some test case to reduce testing time * [ci/tests] continue to remove test case to reduce ci time cost * restore some test config * [ci/tests] continue to reduce ci time cost * [misc] update dockerfile (#5776) * [misc] update dockerfile * [misc] update dockerfile * [devops] fix docker ci (#5780) * [Inference]Add Streaming LLM (#5745) * Add Streaming LLM * add some parameters to llama_generation.py * verify streamingllm config * add test_streamingllm.py * modified according to the opinions of review * add Citation * change _block_tables tolist * [hotfix] fix llama flash attention forward (#5777) * [misc] Accelerate CI for zero and dist optim (#5758) * remove fp16 from lamb * remove d2h copy in checking states --------- Co-authored-by: Edenzzzz <[email protected]> * [Test/CI] remove test cases to reduce CI duration (#5753) * [test] smaller gpt2 test case * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py * Revert "[test] smaller gpt2 test case" Some tests might depend on the size of model (num of chunks) This reverts commit df705a5. * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py * [CI] smaller test model for two mwo the two modifid cases * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there * [hotfix] fix testcase in test_fx/test_tracer (#5779) * [fix] branch for fix testcase; * [fix] fix test_analyzer & test_auto_parallel; * [fix] remove local change about moe; * [fix] rm local change moe; * [fix] fix test_deepfm_model & test_dlrf_model; * [fix] fix test_hf_albert & test_hf_gpt; * [gemini] optimize reduce scatter d2h copy (#5760) * [gemini] optimize reduce scatter d2h copy * [fix] fix missing reduce variable * [refactor] remove legacy async reduce scatter code * [gemini] missing sync * Revert "[refactor] remove legacy async reduce scatter code" This reverts commit 58ad76d. * [gemini] further optimize with async all reduce * [fix] pass flag from manager to chunk * Allow building cuda extension without a device. (#5535) Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are. * [misc] fix dist logger (#5782) * [install]fix setup (#5786) * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [misc] update requirements (#5787) * [shardformer] fix import (#5788) * upgrade colossal-chat support tp_group>1, add sp for sft * upgrade ppo dpo rm script * run pre-commit * moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy * fix training script * fix ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix transformers version * remove duplicated test * fix datasets version * remove models that require huggingface auth from ci * remove local data path * update ci * remove baichuan from template test due to transformer version conflict * merge * Refactor modeling by adding attention backend Signed-off-by: char-1ee <[email protected]> * Fix tests and naming Signed-off-by: char-1ee <[email protected]> * Pass inference model shard configs for module init Signed-off-by: char-1ee <[email protected]> * Clean up Signed-off-by: char-1ee <[email protected]> * replace the customized dataloader setup with the build-in one * replace the customized dataloader setup with the build-in one * Remove flash attention backend Signed-off-by: char-1ee <[email protected]> * fix readme * Fix test import Signed-off-by: char-1ee <[email protected]> * update sft trainning script * [Inference]refactor baichuan (#5791) * refactor baichuan * remove unused code and add TODO for lazyinit * [test] fix chatglm test kit (#5793) * [shardformer] fix modeling of bloom and falcon (#5796) * [test] fix qwen2 pytest distLarge (#5797) * [Inference] Fix flash-attn import and add model test (#5794) * Fix torch int32 dtype Signed-off-by: char-1ee <[email protected]> * Fix flash-attn import Signed-off-by: char-1ee <[email protected]> * Add generalized model test Signed-off-by: char-1ee <[email protected]> * Remove exposed path to model Signed-off-by: char-1ee <[email protected]> * Add default value for use_flash_attn Signed-off-by: char-1ee <[email protected]> * Rename model test Signed-off-by: char-1ee <[email protected]> --------- Signed-off-by: char-1ee <[email protected]> * [Gemini] Use async stream to prefetch and h2d data moving (#5781) * use async stream to prefetch and h2d data moving * Remove redundant code * [gemini] quick fix on possible async operation (#5803) * [gemini] quick fix on possible async operation * [gemini] quick fix on possible async operation * [shardformer] upgrade transformers to 4.39.3 (#5815) * [shardformer]upgrade transformers for gpt2/gptj/whisper (#5807) * [shardformer] fix modeling of gpt2 and gptj * [shardformer] fix whisper modeling * [misc] update requirements --------- Co-authored-by: ver217 <[email protected]> * [shardformer]upgrade transformers for mistral (#5808) * upgrade transformers for mistral * fix * fix * [shardformer]upgrade transformers for llama (#5809) * update transformers fix * fix * fix * [inference] upgrade transformers (#5810) * update transformers fix * fix * fix * fix * fix * [gemini] update transformers for gemini (#5814) --------- Co-authored-by: ver217 <[email protected]> * Support 4d parallel + flash attention (#5789) * support tp + sp + pp * remove comments --------- Co-authored-by: Edenzzzz <[email protected]> --------- Signed-off-by: char-1ee <[email protected]> Co-authored-by: Yuanheng Zhao <[email protected]> Co-authored-by: Hongxin Liu <[email protected]> Co-authored-by: flybird11111 <[email protected]> Co-authored-by: duanjunwen <[email protected]> Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: botbw <[email protected]> Co-authored-by: Charles Coulombe <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: char-1ee <[email protected]> Co-authored-by: Runyu Lu <[email protected]> Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Guangyao Zhang <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 729388e - Browse repository at this point
Copy the full SHA 729388eView commit details -
Configuration menu - View commit details
-
Copy full SHA for d9ea6d4 - Browse repository at this point
Copy the full SHA d9ea6d4View commit details -
Configuration menu - View commit details
-
Copy full SHA for b04e99c - Browse repository at this point
Copy the full SHA b04e99cView commit details
Commits on Jun 20, 2024
-
[zero] add low level optimizer back (#5839)
* [zero] fix param & refactor * [zero] add back original low level opt * [zero] remove moe related * [zero] pass zero tests * [zero] refactor * [chore] add del func back
Configuration menu - View commit details
-
Copy full SHA for 62cd25d - Browse repository at this point
Copy the full SHA 62cd25dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 204d25c - Browse repository at this point
Copy the full SHA 204d25cView commit details -
* [zero] modify api * [test] remove _grad_store access in tests
Configuration menu - View commit details
-
Copy full SHA for efdfa06 - Browse repository at this point
Copy the full SHA efdfa06View commit details
Commits on Jun 26, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 44aeccc - Browse repository at this point
Copy the full SHA 44aecccView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9398484 - Browse repository at this point
Copy the full SHA 9398484View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5e551f8 - Browse repository at this point
Copy the full SHA 5e551f8View commit details
Commits on Jun 27, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 2ff332c - Browse repository at this point
Copy the full SHA 2ff332cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 75be843 - Browse repository at this point
Copy the full SHA 75be843View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1855442 - Browse repository at this point
Copy the full SHA 1855442View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3a25166 - Browse repository at this point
Copy the full SHA 3a25166View commit details -
Configuration menu - View commit details
-
Copy full SHA for 502e514 - Browse repository at this point
Copy the full SHA 502e514View commit details -
Configuration menu - View commit details
-
Copy full SHA for 494b8a2 - Browse repository at this point
Copy the full SHA 494b8a2View commit details -
Configuration menu - View commit details
-
Copy full SHA for 961e96f - Browse repository at this point
Copy the full SHA 961e96fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 95c4c0b - Browse repository at this point
Copy the full SHA 95c4c0bView commit details
Commits on Jun 28, 2024
-
[misc] remove useless code, add assertion about sequence parallel, mo…
…ve logger into function
Configuration menu - View commit details
-
Copy full SHA for 9e966b9 - Browse repository at this point
Copy the full SHA 9e966b9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 165e894 - Browse repository at this point
Copy the full SHA 165e894View commit details