Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix peft import error Signed-off-by: helenxie-bit <[email protected]> * update settings of the job Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix error detection Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * resolve conflict Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix NoneType error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * test bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * find bug Signed-off-by: helenxie-bit <[email protected]> * add storage_config Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce pvc size Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * set storage_config Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * use gpu Signed-off-by: helenxie-bit <[email protected]> * fix 'set_device' error Signed-off-by: helenxie-bit <[email protected]> * add timeout error Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix typo Signed-off-by: helenxie-bit <[email protected]> * update e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * add num_labels Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * change sequence of e2e tests Signed-off-by: helenxie-bit <[email protected]> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function-add check disk Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker directory Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <[email protected]> * update pip install Signed-off-by: helenxie-bit <[email protected]> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <[email protected]> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * check docker volumes Signed-off-by: helenxie-bit <[email protected]> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * update cleanup function Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train api Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * fix parameter of namespace Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * reduce resources Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * remove go setup Signed-off-by: helenxie-bit <[email protected]> * adjust the version of k8s Signed-off-by: helenxie-bit <[email protected]> * move test file to new place Signed-off-by: helenxie-bit <[email protected]> * fix typos Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update install packages Signed-off-by: helenxie-bit <[email protected]> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * fix image build error Signed-off-by: helenxie-bit <[email protected]> * check disk space Signed-off-by: helenxie-bit <[email protected]> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <[email protected]> * separate step of loading images Signed-off-by: helenxie-bit <[email protected]> * check disk space after loading image Signed-off-by: helenxie-bit <[email protected]> * clean up and check disk space Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * prune docker build cache Signed-off-by: helenxie-bit <[email protected]> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <[email protected]> * move working directory Signed-off-by: helenxie-bit <[email protected]> * delete moving working directory Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * use 'docker system prune' Signed-off-by: helenxie-bit <[email protected]> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * update base image Signed-off-by: helenxie-bit <[email protected]> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <[email protected]> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <[email protected]> * check for timeout error Signed-off-by: helenxie-bit <[email protected]> * fix name of trainer image Signed-off-by: helenxie-bit <[email protected]> * fix env of building storage initializer image Signed-off-by: helenxie-bit <[email protected]> * clean format Signed-off-by: helenxie-bit <[email protected]> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <[email protected]> * Update name of fileholder Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * separate e2e test for train API Signed-off-by: helenxie-bit <[email protected]> * fix format Signed-off-by: helenxie-bit <[email protected]> * move test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * update path to test script Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * update kubernetes version Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <[email protected]> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
- Loading branch information