Skip to content

arb8020/cloudforge-ml

Repository files navigation

cloudforge-ml

goal: ~one-click deployment for hugging face models/datasets on arbitrary cloud compute platforms

[demo.gif] # tbd

current status

  • supported providers: runpod
  • low-friction workflows
    • initialize projects with one command using initialize_project.py
    • easy setup with default config.yaml and bash script that runs on your pod
    • deploy files via scp and run your scripts with one command using deploy_runpod.py
  • example recipes
    • huggingface: gpt2 training with tiny-shakespeare
    • custom models: deploy models that inherit from hf architectures
    • custom projects: define your own training scripts with minimal setup

quick start

runpod/ssh setup (required)

  1. get api key: settings > api keys (https://www.runpod.io/console/user/settings)
  2. create .env: RUNPOD_API_KEY=your_key_here
  3. make ssh key: ssh-keygen -t ed25519 -C "[email protected]"
  4. getting key: use cat on the path it tells you it saved to, to copy the key
  5. add key: settings > ssh keys (https://www.runpod.io/console/user/settings)

standard workflow

# train any HF model on any dataset with automatic cost tracking
uv run hf_train.py --model openai-community/gpt2 --dataset karpathy/tiny_shakespeare

# or use your own model/dataset
uv run initialize_hf_project.py my_project
uv run hf_train.py --model ./projects/my_project/model.py --dataset ./projects/my_project/dataset.py

custom workflow

# initialize project
uv run initialize_project.py my_project

# update config.yaml, run_script.sh, script.py, files to scp over as needed
# check out example_cifar to see how you might clone and run a personal repository
uv run deploy_runpod.py --project my_project

features

  • automatic cost tracking and budget controls
  • graceful error handling and cleanup
  • environment and SSH key management
  • modular design for provider expansion

detailed usage info

deploying a custom project to RunPod

if you want to create a custom project, run:

uv run initialize_project.py my_project

this will create projects/my_project with:

  • config.yaml (your deployment config)
  • run_script.sh (default bash script to run on the pod)
  • script.py (a sample python script) from there, edit script.py, add files, install dependencies, etc.

once you're ready you can use '''bash uv run deploy_runpod.py --project=my_project ''' this will

  • look up projects/my_project/config.yaml
  • create a new pod on RunPod
  • upload the files in your project directory
  • execute run_script.sh
  • terminate the pod on completion/failed deployment unless --keep-alive is specified

training a huggingface model

if you want to train a text generation model with huggingface

uv run hf_train.py --model <HF_MODEL_OR_LOCAL.py> --dataset <HF_DATASET_OR_LOCAL.py> [--keep-alive]

example: train GPT2 on tiny_shakespeare uv run hf_train.py --model gpt2 --dataset karpathy/tiny_shakespeare

key things to note:

  • if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
  • if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.

initializing a custom huggingface run

if you want to initialize training a custom text generation model with huggingface, and your own custom dataset

uv run initialize_hf_project.py my_hf

example: train mistral on tiny_shakespeare uv run hf_train.py --model mistralai/Mistral-7B-Instruct-v0.3 --dataset karpathy/tiny_shakespeare

key things to note:

  • if you're using a gated model (like in the example), be sure you have access and you put your huggingface token in your .env
  • if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
  • if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.

examples to help you get started

There are a few sample projects in projects:

example_cifar: clones a CIFAR10 speedrun repository, installs deps, and runs training.

uv run deploy_runpod.py --project=example_cifar

example_gpt2: sets up training for gpt2 on the tiny shakespeare dataset

uv run deploy_runpod.py --project=example_gpt2

example_gpt2: basic GPT-2 text training script with HF + datasets. example_hf: Another example that demonstrates using hf_train.py with a local script.

common errors

ERROR - SCP upload failed: Read from remote host Connection reset by peer```

runpod connections can be finicky, just try again (will eventually replace this with rsync), this has been acting up more recently (past 24 hours), reason unknown


```bash
Pod is running but runtime information is not yet available. Waiting...```

if you get stuck here for over a few minutes (note: comfy workflow can take up to 10min), check the logs on runpod.io for any errors and manually terminate - there may be something wrong with your custom config parameters


## roadmap
- [5/5] core features
  - [x] runpod integration
  - [x] project initialization
  - [x] file deployment
  - [x] auto-ssh after script execution
  - [x] one-command training for HF models/datasets
  - [x] one-command initialization for ComfyUI text2img workflows

- [0/6] UI/UX
  - [ ] logs could be cleaner, uv especially really hogs space in the logs
  - [ ] runpod likes to randomly reset connection during scp file transfer every once in a blue moon
  - [ ] better abstractions/code organization for continuing work (extracting templates, etc)
  - [ ] better default pod naming
  - [ ] smarter dependency management (selectively loading transformers optional dependencies like sentencepiece )
  - [ ] smoother setup wizard

- [0/5] research features
  - [ ] support for tasks beyond text generation
  - [ ] advanced training (FSDP, checkpointing)
  - [ ] spot instances + interruption handling
  - [ ] wandb integration
  - [ ] multi-GPU support

- [0/4] infrastructure
  - [ ] provider abstraction layer
  - [ ] vast.ai support
  - [ ] aws/gcp integration
  - [ ] cost optimization

- [0/2] huggingface
  - [ ] support for tasks other than text generation
  - [ ] pre + post training pipeline

- [0/2] comfyui
  - [ ] customizable dockerfile/runpod template to change bootup behavior (automatically downloaded models, etc)
  - [ ] bootup with more example workflows

- [0/4] recipes
  - [ ] notebook recipe
  - [ ] high performance recipe
  - [ ] model chat interface recipe

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published