Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MoE/ZeRO] Moe refactor with zero refactor #5821

Merged
merged 54 commits into from
Jun 28, 2024
Merged

[MoE/ZeRO] Moe refactor with zero refactor #5821

merged 54 commits into from
Jun 28, 2024

Commits on May 29, 2024

  1. Configuration menu
    Copy the full SHA
    f1d4167 View commit details
    Browse the repository at this point in the history
  2. [Feauture] MoE refractor; Intergration with Mixtral (#5682)

    * cherry pick from refractor-moe branch
    
    * tests passed
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    * support ep + zero
    
    ---------
    
    Co-authored-by: Edenzzzz <[email protected]>
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    3 people authored and ver217 committed May 29, 2024
    Configuration menu
    Copy the full SHA
    df6826d View commit details
    Browse the repository at this point in the history

Commits on May 31, 2024

  1. Configuration menu
    Copy the full SHA
    d49fd63 View commit details
    Browse the repository at this point in the history

Commits on Jun 4, 2024

  1. Configuration menu
    Copy the full SHA
    d2e07fc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    7556b8f View commit details
    Browse the repository at this point in the history
  3. fix typos

    Hz188 committed Jun 4, 2024
    Configuration menu
    Copy the full SHA
    16329d5 View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2024

  1. Configuration menu
    Copy the full SHA
    b934437 View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2024

  1. Configuration menu
    Copy the full SHA
    a792e83 View commit details
    Browse the repository at this point in the history
  2. fix typo

    Hz188 committed Jun 6, 2024
    Configuration menu
    Copy the full SHA
    d203ba8 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    55c7416 View commit details
    Browse the repository at this point in the history
  4. remove some useless code

    Hz188 committed Jun 6, 2024
    Configuration menu
    Copy the full SHA
    8915e9d View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2024

  1. Configuration menu
    Copy the full SHA
    7963fb0 View commit details
    Browse the repository at this point in the history
  2. fix typo & prettifier the code

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    32ced74 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3100c1b View commit details
    Browse the repository at this point in the history
  4. remove useless code

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    928ee39 View commit details
    Browse the repository at this point in the history
  5. reanme function

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    6dc0cfc View commit details
    Browse the repository at this point in the history
  6. fix typo

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    4417840 View commit details
    Browse the repository at this point in the history
  7. fix typo

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    eb35655 View commit details
    Browse the repository at this point in the history
  8. Further improve the test code

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    d1d446b View commit details
    Browse the repository at this point in the history
  9. remove print code

    Hz188 committed Jun 7, 2024
    Configuration menu
    Copy the full SHA
    09a5188 View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2024

  1. [moe refactor] change test model from fake moe model to mixtral moe l…

    …ayer and remove useless test
    Hz188 committed Jun 11, 2024
    Configuration menu
    Copy the full SHA
    4c6ea42 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    80b6586 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    7d06220 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    fb41f42 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e99b69c View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2024

  1. Configuration menu
    Copy the full SHA
    af9ade6 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #5775 from Hz188/feature/moe

    [Feauture] MoE refactor
    botbw committed Jun 12, 2024
    Configuration menu
    Copy the full SHA
    49d74f3 View commit details
    Browse the repository at this point in the history
  3. [moe/zero] refactor low level optimizer (#5767)

    * [zero] refactor low level optimizer
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    botbw and pre-commit-ci[bot] committed Jun 12, 2024
    Configuration menu
    Copy the full SHA
    d71ab10 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    88f318a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b2ac7e5 View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2024

  1. Configuration menu
    Copy the full SHA
    346a0df View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2024

  1. Merge pull request #5811 from botbw/moe

    [zero] remove redundant members in BucketStore
    botbw committed Jun 14, 2024
    Configuration menu
    Copy the full SHA
    a3a7d7d View commit details
    Browse the repository at this point in the history
  2. [Moe/Zero] Update MoeHybridParallelPlugin with refactored ZeRO and Fi…

    …x Zero bug (#5819)
    
    * [moe refactor] update unit test with the refactored ZeRO and remove useless test
    
    * move moe checkpoint to checkpoint folder and exchange global axis to class member
    
    * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug
    
    * fix zero unit test
    
    * Add an assertion to prevent users from using it incorrectly
    Hz188 committed Jun 14, 2024
    Configuration menu
    Copy the full SHA
    ba0115a View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2024

  1. [hotfix]Solve the compatibility issue of zero refactor (#5823)

    * [moe refactor] update unit test with the refactored ZeRO and remove useless test
    
    * move moe checkpoint to checkpoint folder and exchange global axis to class member
    
    * update moe hybrid parallel plugin with newest version of zero & fix zero working/master params bug
    
    * fix zero unit test
    
    * Add an assertion to prevent users from using it incorrectly
    
    * Modify function parameter names to resolve compatibility issues
    Hz188 committed Jun 17, 2024
    Configuration menu
    Copy the full SHA
    a10802e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4cd4a1f View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2024

  1. [MoE] Resolve .github conflict (#5829)

    * [Fix/Example] Fix Llama Inference Loading Data Type (#5763)
    
    * [fix/example] fix llama inference loading dtype
    
    * revise loading dtype of benchmark llama3
    
    * [release] update version (#5752)
    
    * [release] update version
    
    * [devops] update compatibility test
    
    * [devops] update compatibility test
    
    * [devops] update compatibility test
    
    * [devops] update compatibility test
    
    * [test] fix ddp plugin test
    
    * [test] fix gptj and rpc test
    
    * [devops] fix cuda ext compatibility
    
    * [inference] fix flash decoding test
    
    * [inference] fix flash decoding test
    
    * fix (#5765)
    
    * [test] Fix/fix testcase (#5770)
    
    * [fix] branch for fix testcase;
    
    * [fix] fix test_analyzer & test_auto_parallel;
    
    * [fix] remove local change about moe;
    
    * [fix] rm local change moe;
    
    * [Hotfix] Add missing init file in inference.executor (#5774)
    
    * [CI/tests] simplify some test case to reduce testing time (#5755)
    
    * [ci/tests] simplify some test case to reduce testing time
    
    * [ci/tests] continue to remove test case to reduce ci time cost
    
    * restore some test config
    
    * [ci/tests] continue to reduce ci time cost
    
    * [misc] update dockerfile (#5776)
    
    * [misc] update dockerfile
    
    * [misc] update dockerfile
    
    * [devops] fix docker ci (#5780)
    
    * [Inference]Add Streaming LLM (#5745)
    
    * Add Streaming LLM
    
    * add some parameters to llama_generation.py
    
    * verify streamingllm config
    
    * add test_streamingllm.py
    
    * modified according to the opinions of review
    
    * add Citation
    
    * change _block_tables tolist
    
    * [hotfix] fix llama flash attention forward (#5777)
    
    * [misc] Accelerate CI for zero and dist optim (#5758)
    
    * remove fp16 from lamb
    
    * remove d2h copy in checking states
    
    ---------
    
    Co-authored-by: Edenzzzz <[email protected]>
    
    * [Test/CI] remove test cases to reduce CI duration (#5753)
    
    * [test] smaller gpt2 test case
    
    * [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py
    
    * [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py
    
    * [test] reduce test cases tests/test_zero/test_gemini/test_optim.py
    
    * Revert "[test] smaller gpt2 test case"
    
    Some tests might depend on the size of model (num of chunks)
    
    This reverts commit df705a5.
    
    * [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py
    
    * [CI] smaller test model for two mwo the two modifid cases
    
    * [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there
    
    * [hotfix] fix testcase in test_fx/test_tracer (#5779)
    
    * [fix] branch for fix testcase;
    
    * [fix] fix test_analyzer & test_auto_parallel;
    
    * [fix] remove local change about moe;
    
    * [fix] rm local change moe;
    
    * [fix] fix test_deepfm_model & test_dlrf_model;
    
    * [fix] fix test_hf_albert & test_hf_gpt;
    
    * [gemini] optimize reduce scatter d2h copy (#5760)
    
    * [gemini] optimize reduce scatter d2h copy
    
    * [fix] fix missing reduce variable
    
    * [refactor] remove legacy async reduce scatter code
    
    * [gemini] missing sync
    
    * Revert "[refactor] remove legacy async reduce scatter code"
    
    This reverts commit 58ad76d.
    
    * [gemini] further optimize with async all reduce
    
    * [fix] pass flag from manager to chunk
    
    * Allow building cuda extension without a device. (#5535)
    
    Added FORCE_CUDA environment variable support, to enable building extensions where a GPU device is not present but cuda libraries are.
    
    * [misc] fix dist logger (#5782)
    
    * [install]fix setup (#5786)
    
    * fix
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    
    * [misc] update requirements (#5787)
    
    * [shardformer] fix import (#5788)
    
    * upgrade colossal-chat support tp_group>1, add sp for sft
    
    * upgrade ppo dpo rm script
    
    * run pre-commit
    
    * moupdate ci tests, st ci test cases passed, tp failed in generation for ppo, sp is buggy
    
    * fix training script
    
    * fix ci
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    * fix transformers version
    
    * remove duplicated test
    
    * fix datasets version
    
    * remove models that require huggingface auth from ci
    
    * remove local data path
    
    * update ci
    
    * remove baichuan from template test due to transformer version conflict
    
    * merge
    
    * Refactor modeling by adding attention backend
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Fix tests and naming
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Pass inference model shard configs for module init
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Clean up
    
    Signed-off-by: char-1ee <[email protected]>
    
    * replace the customized dataloader setup with the build-in one
    
    * replace the customized dataloader setup with the build-in one
    
    * Remove flash attention backend
    
    Signed-off-by: char-1ee <[email protected]>
    
    * fix readme
    
    * Fix test import
    
    Signed-off-by: char-1ee <[email protected]>
    
    * update sft trainning script
    
    * [Inference]refactor baichuan (#5791)
    
    * refactor baichuan
    
    * remove unused code and add TODO for lazyinit
    
    * [test] fix chatglm test kit (#5793)
    
    * [shardformer] fix modeling of bloom and falcon (#5796)
    
    * [test] fix qwen2 pytest distLarge (#5797)
    
    * [Inference] Fix flash-attn import and add model test (#5794)
    
    * Fix torch int32 dtype
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Fix flash-attn import
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Add generalized model test
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Remove exposed path to model
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Add default value for use_flash_attn
    
    Signed-off-by: char-1ee <[email protected]>
    
    * Rename model test
    
    Signed-off-by: char-1ee <[email protected]>
    
    ---------
    
    Signed-off-by: char-1ee <[email protected]>
    
    * [Gemini] Use async stream to prefetch and h2d data moving (#5781)
    
    * use async stream to prefetch and h2d data moving
    
    * Remove redundant code
    
    * [gemini] quick fix on possible async operation (#5803)
    
    * [gemini] quick fix on possible async operation
    
    * [gemini] quick fix on possible async operation
    
    * [shardformer] upgrade transformers to 4.39.3 (#5815)
    
    * [shardformer]upgrade transformers for gpt2/gptj/whisper (#5807)
    
    * [shardformer] fix modeling of gpt2 and gptj
    
    * [shardformer] fix whisper modeling
    
    * [misc] update requirements
    
    ---------
    
    Co-authored-by: ver217 <[email protected]>
    
    * [shardformer]upgrade transformers for mistral (#5808)
    
    * upgrade transformers for mistral
    
    * fix
    
    * fix
    
    * [shardformer]upgrade transformers for llama (#5809)
    
    * update transformers
    
    fix
    
    * fix
    
    * fix
    
    * [inference] upgrade transformers (#5810)
    
    * update transformers
    
    fix
    
    * fix
    
    * fix
    
    * fix
    
    * fix
    
    * [gemini] update transformers for gemini (#5814)
    
    ---------
    
    Co-authored-by: ver217 <[email protected]>
    
    * Support 4d parallel + flash attention (#5789)
    
    * support tp + sp + pp
    
    * remove comments
    
    ---------
    
    Co-authored-by: Edenzzzz <[email protected]>
    
    ---------
    
    Signed-off-by: char-1ee <[email protected]>
    Co-authored-by: Yuanheng Zhao <[email protected]>
    Co-authored-by: Hongxin Liu <[email protected]>
    Co-authored-by: flybird11111 <[email protected]>
    Co-authored-by: duanjunwen <[email protected]>
    Co-authored-by: yuehuayingxueluo <[email protected]>
    Co-authored-by: Edenzzzz <[email protected]>
    Co-authored-by: Edenzzzz <[email protected]>
    Co-authored-by: botbw <[email protected]>
    Co-authored-by: Charles Coulombe <[email protected]>
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: YeAnbang <[email protected]>
    Co-authored-by: char-1ee <[email protected]>
    Co-authored-by: Runyu Lu <[email protected]>
    Co-authored-by: YeAnbang <[email protected]>
    Co-authored-by: Guangyao Zhang <[email protected]>
    16 people committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    729388e View commit details
    Browse the repository at this point in the history
  2. [zero] fix hook bug

    Hz188 committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    d9ea6d4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b04e99c View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2024

  1. [zero] add low level optimizer back (#5839)

    * [zero] fix param & refactor
    
    * [zero] add back original low level opt
    
    * [zero] remove moe related
    
    * [zero] pass zero tests
    
    * [zero] refactor
    
    * [chore] add del func back
    botbw committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    62cd25d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    204d25c View commit details
    Browse the repository at this point in the history
  3. [zero] modify api (#5843)

    * [zero] modify api
    
    * [test] remove _grad_store access in tests
    botbw committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    efdfa06 View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2024

  1. [test] fix (#5857)

    botbw committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    44aeccc View commit details
    Browse the repository at this point in the history
  2. [CI] skip openmoe CI check

    Hz188 committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    9398484 View commit details
    Browse the repository at this point in the history
  3. [CI] fox pre-commit

    Hz188 committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    5e551f8 View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2024

  1. Configuration menu
    Copy the full SHA
    2ff332c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    75be843 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1855442 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3a25166 View commit details
    Browse the repository at this point in the history
  5. [misc] use tempfile

    Hz188 committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    502e514 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    494b8a2 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    961e96f View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    95c4c0b View commit details
    Browse the repository at this point in the history

Commits on Jun 28, 2024

  1. [misc] remove useless code, add assertion about sequence parallel, mo…

    …ve logger into function
    Hz188 committed Jun 28, 2024
    Configuration menu
    Copy the full SHA
    9e966b9 View commit details
    Browse the repository at this point in the history
  2. [misc] remove useless code

    Hz188 committed Jun 28, 2024
    Configuration menu
    Copy the full SHA
    165e894 View commit details
    Browse the repository at this point in the history