Clone the project:
git clone https://github.com/ETOgaosion/Hetaceso.git --recurse-submodules
If you forget to clone submodules, please do:
git submodules update --init --recursive
Then start docker, and enter the container::
chmod +x script/*.sh
./script/start_docker.sh
docker exec -it hetaceso-[USERNAME] bash
In the container, check the network:
root# curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
Install Transformer Engine in this project:
pip install -e external/TransformerEngine
It takes 10s minutes to finish setup.
- Imbalanced assignment of dp/sp workloads
- Megatron new features support
- RoPE (with CP)
- MoE
- overlap
- Zero-1 (Distributed Saved Activation)
- Support for Double-CP (Ring-Ulysses)
- Support Profile and Search
- Never push directly to dev branch, use pull request and discuss with other participants
- Debug use dev-[username] branch, sync with dev branch
- Function development use dev-[username]-[functionname] branch, can be independent
- Use Black formatter
- Try to include a function or whole debug process in one commit and PR, for others to check conviniently