Skip to content

Commit

Permalink
llama factory
Browse files Browse the repository at this point in the history
  • Loading branch information
wangzheng422 committed Jan 9, 2025
1 parent 657f849 commit 805984d
Show file tree
Hide file tree
Showing 12 changed files with 861 additions and 16 deletions.
89 changes: 73 additions & 16 deletions redhat/ocp4/4.16/2024.12.llama.factory.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,27 @@
> [!WARNING]
> working in progress
# running llama factory using ray on openshift ai

Customer wants to try running [llama factory](https://github.com/hiyouga/LLaMA-Factory) on OpenShift AI. We know that we can run llama factory on openshift using deployment, but this requires additional configuration like image building and deployment configuration, like setting start up configuration parameters. The most difficult part is to get the master/head's ip address, and set to worker task's parameter to ensure that the workers can communicate with the master node.

# build image
In this artical, we try to set up the openshift ai env, and start a llama factory distribution training task involving deepspeed.

To run llama factory using rayjob, we need to build the image contains the rayjob and llama factory.
The logic diagram illustrating the process flow for setting up and running LLaMA Factory on OpenShift AI:

## build on os
![](./dia/ocp.ai.llama.factory.01.drawio.png)

We try to run everything on a rocky linux server. As a best practice, it is better to run everything on a vm, if everything is ok, we can wrap it into a container later.
We will use such source code, listed here:
- https://github.com/wangzheng422/LLaMA-Factory/tree/wzh-stable/wzh
- [python notebook to run llama factory in python venv mode](./files/ray.notebook/2.5_run.llama.factory.ipynb)
- [python notebook to run llama factory in python direct mode (with deepspeed)](./files/ray.notebook/2.5_run.llama.factory.deepspeed.ipynb)

# run using llama factory offical way

To run llama factory using ray, we need first to understand how to run llama factory. So we try to run llama factory using native method, and get familiar with the llama factory's execution process and its requirements.

## run directly on os

We try to run everything on a rhel9 linux server. As a best practice, it is better to run everything on a vm, if everything is ok, we can wrap it into a container later.

```bash

Expand Down Expand Up @@ -109,8 +123,6 @@ quay.io/wangzheng422/qimgs:llama-factory-20241225-v01 \
FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft.yaml




# 2nd instance

podman run --rm -it --pod llama-factory-pod \
Expand All @@ -131,25 +143,31 @@ We can not make it run on multiple nodes, because we do not have nvidia gpu, the
But this will not stop us, our target is to run the multiple node task using `ray job` to manage the distributed training effectively. The job failed does not deter us from exploring alternative solutions.


# run multiple instance using deployment on ocp
# run multiple instance using ray based image

We can run llama factory using offical ways. Next, we want to test how to run llama factory using ray platform which is integrated into openshift ai by default.

Llama factory DOSE support ray based on [this issues](https://github.com/hiyouga/LLaMA-Factory/issues/6543), but it is just 2 days ago based on the time of this writing. So we first try to run llama factory using old cli. Later, we will try to run it in native ray way.

## build image

First, we build the Docker image using the ray.dockerfile, which is based on ray upstream docker file, and merge llama factory docker file.

```bash

cd /data/git/LLaMA-Factory

podman build -t quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 -f wzh/ray.dockerfile .
podman build -t quay.io/wangzheng422/qimgs:llama-factory-ray-20250109-v01 -f wzh/ray.dockerfile .


podman run --rm -it quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 /bin/bash


podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07
podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250109-v01

```

using venv
using venv, we also try to install llama factory python dependency into a seperate venv.

```bash
cd /data/git/LLaMA-Factory
Expand All @@ -166,6 +184,8 @@ podman push quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v06

## try to run multipe node

Then, we try to run multiple node using the ray based image, by start multiple container instance.

```bash

# first instance
Expand All @@ -181,8 +201,6 @@ quay.io/wangzheng422/qimgs:llama-factory-ray-20241226-v01 \
FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft.yaml




# 2nd instance

podman run --rm -it --pod llama-factory-pod \
Expand All @@ -199,6 +217,8 @@ FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPR

## try to run in deepspeed

Llama factory embeded deepspeed, so we try to run the training with deepspeed.

```bash

# first instance
Expand All @@ -212,8 +232,6 @@ quay.io/wangzheng422/qimgs:llama-factory-ray-20250106-v07 \
FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPROC_PER_NODE=1 llamafactory-cli train wzh/tinyllama_lora_sft_dp.yaml




# 2nd instance

podman run --rm -it \
Expand All @@ -228,11 +246,50 @@ FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=10.5.0.3 MASTER_PORT=29500 NPR

```

## try basic ray
The result turns out relative good, we can run llama factory with deepspeed using local podman containers.

<!-- ## try basic ray
![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-39.png)
![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-52.png)
![](imgs/2024.12.llama.factory.md/2024-12-26-21-10-52.png) -->


# run llama factory

Now, we try to run with openshift ai. First, create notebook env by launch the jupyter application.

![](imgs/2024.12.llama.factory.md/2025-01-09-16-04-25.png)

Then, select the `data science` notebook image type.

![](imgs/2024.12.llama.factory.md/2025-01-09-16-04-54.png)

After entering into jupyter, upload [the example notebook](./files/ray.notebook/2.5_run.llama.factory.deepspeed.ipynb) into the env.

![](imgs/2024.12.llama.factory.md/2025-01-09-16-25-58.png)

The notebook will create ray cluster, and create job in the ray cluster, it will create 2 actor, and collect ip address of the actors, then it will call actor's function, to start llama factory command line, and then to start the training.

We can see the ray dashboard.

![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-21.png)

![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-42.png)

![](imgs/2024.12.llama.factory.md/2025-01-09-15-06-56.png)

![](imgs/2024.12.llama.factory.md/2025-01-09-15-07-20.png)

![](imgs/2024.12.llama.factory.md/2025-01-09-15-16-19.png)

## run llama factory in ray mode

Based on llama factory issue, it supports ray framework, so we can import [this notebook](./files/ray.notebook/2.5_run.llama.factory.ray.ipynb) into jupyter to try it out.

But currently, llama factory only support GPU node, we can not test it on CPU node. But the notebook can run, you can test it on GPU node.

![](imgs/2024.12.llama.factory.md/2025-01-09-23-21-01.png)


# end
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 805984d

Please sign in to comment.