llama factory

wangzheng422 · Jan 9, 2025 · 805984d · 805984d
1 parent 657f849
commit 805984d
Show file tree

Hide file tree

Showing 12 changed files with 861 additions and 16 deletions.
diff --git a/redhat/ocp4/4.16/2024.12.llama.factory.md b/redhat/ocp4/4.16/2024.12.llama.factory.md
@@ -1,13 +1,27 @@
+> [!WARNING]
+> working in progress
 # running llama factory using ray on openshift ai
 
+Customer wants to try running [llama factory](https://github.com/hiyouga/LLaMA-Factory) on OpenShift AI. We know that we can run llama factory on openshift using deployment, but this requires additional configuration like image building and deployment configuration, like setting start up configuration parameters. The most difficult part is to get the master/head's ip address, and set to worker task's parameter to ensure that the workers can communicate with the master node.
 
-# build image
+In this artical, we try to set up the openshift ai env, and start a llama factory distribution training task involving deepspeed.
 
-To run llama factory using rayjob, we need to build the image contains the rayjob and llama factory.
+The logic diagram illustrating the process flow for setting up and running LLaMA Factory on OpenShift AI:
 
-## build on os
+![](./dia/ocp.ai.llama.factory.01.drawio.png)
 
-We try to run everything on a rocky linux server. As a best practice, it is better to run everything on a vm, if everything is ok, we can wrap it into a container later.
+We will use such source code, listed here:
+- https://github.com/wangzheng422/LLaMA-Factory/tree/wzh-stable/wzh
+- [python notebook to run llama factory in python venv mode](./files/ray.notebook/2.5_run.llama.factory.ipynb)
+- [python notebook to run llama factory in python direct mode (with deepspeed)](./files/ray.notebook/2.5_run.llama.factory.deepspeed.ipynb)
+
+# run using llama factory offical way
+
+To run llama factory using ray, we need first to understand how to run llama factory. So we try to run llama factory using native method, and get familiar with the llama factory's execution process and its requirements. 
+
+## run directly on os
+
+We try to run everything on a rhel9 linux server. As a best practice, it is better to run everything on a vm, if everything is ok, we can wrap it into a container later.
 
 ```bash
 
@@ -109,8 +123,6 @@ quay.io/wangzheng422/qimgs:llama-factory-20241225-v01 \
 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft.yaml
 
 
-
-
 # 2nd instance
 
 podman run --rm -it --pod llama-factory-pod \
@@ -131,25 +143,31 @@ We can not make it run on multiple nodes, because we do not have nvidia gpu, the
 But this will not stop us, our target is to run the multiple node task using `ray job` to manage the distributed training effectively. The job failed does not deter us from exploring alternative solutions.
 
 
-# run multiple instance using deployment on ocp
+# run multiple instance using ray based image
+
+We can run llama factory using offical ways. Next, we want to test how to run llama factory using ray platform which is integrated into openshift ai by default.
+
+Llama factory DOSE support ray based on [this issues](https://github.com/hiyouga/LLaMA-Factory/issues/6543), but it is just 2 days ago based on the time of this writing. So we first try to run llama factory using old cli. Later, we will try to run it in native ray way.
 
 ## build image
 
+First, we build the Docker image using the ray.dockerfile, which is based on ray upstream docker file, and merge llama factory docker file.
+
 ```bash
 
 cd /data/git/LLaMA-Factory
 
-podman build -t quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 -f wzh/ray.dockerfile .
+podman build -t quay.io/wangzheng422/qimgs:llama-factory-ray-20250109-v01 -f wzh/ray.dockerfile .
 
 
 podman run --rm -it quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 /bin/bash
 
 
-podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07
+podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250109-v01
 
 ```
 
-using venv
+using venv, we also try to install llama factory python dependency into a seperate venv.
 
 ```bash
 cd /data/git/LLaMA-Factory
@@ -166,6 +184,8 @@ podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v06
 
 ## try to run multipe node
 
+Then, we try to run multiple node using the ray based image, by start multiple container instance.
+
 ```bash
 
 # first instance
@@ -181,8 +201,6 @@ quay.io/wangzheng422/qimgs:llama-factory-ray-20241226-v01 \
 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft.yaml
 
 
-
-
 # 2nd instance
 
 podman run --rm -it --pod llama-factory-pod \
@@ -199,6 +217,8 @@ FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPR
 
 ## try to run in deepspeed
 
+Llama factory embeded deepspeed, so we try to run the training with deepspeed.
+
 ```bash
 
 # first instance
@@ -212,8 +232,6 @@ quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 \
 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft_dp.yaml
 
 
-
-
 # 2nd instance
 
 podman run --rm -it \
@@ -228,11 +246,50 @@ FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPR
 
 ```
 
-## try basic ray
+The result turns out relative good, we can run llama factory with deepspeed using local podman containers.
+
+<!-- ## try basic ray
 
 ![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-39.png)
 
-![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-52.png)
+![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-52.png) -->
+
+
+# run llama factory
+
+Now, we try to run with openshift ai. First, create notebook env by launch the jupyter application.
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-16-04-25.png)
+
+Then, select the `data science` notebook image type.
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-16-04-54.png)
+
+After entering into jupyter, upload [the example notebook](./files/ray.notebook/2.5_run.llama.factory.deepspeed.ipynb) into the env.
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-16-25-58.png)
+
+The notebook will create ray cluster, and create job in the ray cluster, it will create 2 actor, and collect ip address of the actors, then it will call actor's function, to start llama factory command line, and then to start the training.
+
+We can see the ray dashboard.
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-21.png)
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-42.png)
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-56.png)
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-15-07-20.png)
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-15-16-19.png)
+
+## run llama factory in ray mode
+
+Based on llama factory issue, it supports ray framework, so we can import [this notebook](./files/ray.notebook/2.5_run.llama.factory.ray.ipynb) into jupyter to try it out.
+
+But currently, llama factory only support GPU node, we can not test it on CPU node. But the notebook can run, you can test it on GPU node.
+
+![](imgs/2024.12.llama.factory.md/2025-01-09-23-21-01.png)
 
 
 # end
diff --git a/redhat/ocp4/4.16/dia/ocp.ai.llama.factory.01.drawio.png b/redhat/ocp4/4.16/dia/ocp.ai.llama.factory.01.drawio.png