cloudforge-ml

goal: ~one-click deployment for hugging face models/datasets on arbitrary cloud compute platforms

[demo.gif] # tbd

current status

supported providers: runpod
low-friction workflows
- initialize projects with one command using initialize_project.py
- easy setup with default config.yaml and bash script that runs on your pod
- deploy files via scp and run your scripts with one command using deploy_runpod.py
example recipes
- huggingface: gpt2 training with tiny-shakespeare
- custom models: deploy models that inherit from hf architectures
- custom projects: define your own training scripts with minimal setup

quick start

runpod/ssh setup (required)

get api key: settings > api keys (https://www.runpod.io/console/user/settings)
create .env: RUNPOD_API_KEY=your_key_here
make ssh key: ssh-keygen -t ed25519 -C "[email protected]"
getting key: use cat on the path it tells you it saved to, to copy the key
add key: settings > ssh keys (https://www.runpod.io/console/user/settings)

standard workflow

# train any HF model on any dataset with automatic cost tracking
uv run hf_train.py --model openai-community/gpt2 --dataset karpathy/tiny_shakespeare

# or use your own model/dataset
uv run initialize_hf_project.py my_project
uv run hf_train.py --model ./projects/my_project/model.py --dataset ./projects/my_project/dataset.py

custom workflow

# initialize project
uv run initialize_project.py my_project

# update config.yaml, run_script.sh, script.py, files to scp over as needed
# check out example_cifar to see how you might clone and run a personal repository
uv run deploy_runpod.py --project my_project

features

automatic cost tracking and budget controls
graceful error handling and cleanup
environment and SSH key management
modular design for provider expansion

detailed usage info

deploying a custom project to RunPod

if you want to create a custom project, run:

uv run initialize_project.py my_project

this will create projects/my_project with:

config.yaml (your deployment config)
run_script.sh (default bash script to run on the pod)
script.py (a sample python script) from there, edit script.py, add files, install dependencies, etc.

once you're ready you can use '''bash uv run deploy_runpod.py --project=my_project ''' this will

look up projects/my_project/config.yaml
create a new pod on RunPod
upload the files in your project directory
execute run_script.sh
terminate the pod on completion/failed deployment unless --keep-alive is specified

training a huggingface model

if you want to train a text generation model with huggingface

uv run hf_train.py --model <HF_MODEL_OR_LOCAL.py> --dataset <HF_DATASET_OR_LOCAL.py> [--keep-alive]

example: train GPT2 on tiny_shakespeare uv run hf_train.py --model gpt2 --dataset karpathy/tiny_shakespeare

key things to note:

if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.

initializing a custom huggingface run

if you want to initialize training a custom text generation model with huggingface, and your own custom dataset

uv run initialize_hf_project.py my_hf

example: train mistral on tiny_shakespeare uv run hf_train.py --model mistralai/Mistral-7B-Instruct-v0.3 --dataset karpathy/tiny_shakespeare

key things to note:

if you're using a gated model (like in the example), be sure you have access and you put your huggingface token in your .env
if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.

examples to help you get started

There are a few sample projects in projects:

example_cifar: clones a CIFAR10 speedrun repository, installs deps, and runs training.

uv run deploy_runpod.py --project=example_cifar

example_gpt2: sets up training for gpt2 on the tiny shakespeare dataset

uv run deploy_runpod.py --project=example_gpt2

example_gpt2: basic GPT-2 text training script with HF + datasets. example_hf: Another example that demonstrates using hf_train.py with a local script.

common errors

ERROR - SCP upload failed: Read from remote host Connection reset by peer```

runpod connections can be finicky, just try again (will eventually replace this with rsync), this has been acting up more recently (past 24 hours), reason unknown


```bash
Pod is running but runtime information is not yet available. Waiting...```

if you get stuck here for over a few minutes (note: comfy workflow can take up to 10min), check the logs on runpod.io for any errors and manually terminate - there may be something wrong with your custom config parameters


## roadmap
- [5/5] core features
  - [x] runpod integration
  - [x] project initialization
  - [x] file deployment
  - [x] auto-ssh after script execution
  - [x] one-command training for HF models/datasets
  - [x] one-command initialization for ComfyUI text2img workflows

- [0/6] UI/UX
  - [ ] logs could be cleaner, uv especially really hogs space in the logs
  - [ ] runpod likes to randomly reset connection during scp file transfer every once in a blue moon
  - [ ] better abstractions/code organization for continuing work (extracting templates, etc)
  - [ ] better default pod naming
  - [ ] smarter dependency management (selectively loading transformers optional dependencies like sentencepiece )
  - [ ] smoother setup wizard

- [0/5] research features
  - [ ] support for tasks beyond text generation
  - [ ] advanced training (FSDP, checkpointing)
  - [ ] spot instances + interruption handling
  - [ ] wandb integration
  - [ ] multi-GPU support

- [0/4] infrastructure
  - [ ] provider abstraction layer
  - [ ] vast.ai support
  - [ ] aws/gcp integration
  - [ ] cost optimization

- [0/2] huggingface
  - [ ] support for tasks other than text generation
  - [ ] pre + post training pipeline

- [0/2] comfyui
  - [ ] customizable dockerfile/runpod template to change bootup behavior (automatically downloaded models, etc)
  - [ ] bootup with more example workflows

- [0/4] recipes
  - [ ] notebook recipe
  - [ ] high performance recipe
  - [ ] model chat interface recipe

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
config		config
projects		projects
runpod		runpod
.env2		.env2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deploy_runpod.py		deploy_runpod.py
hf_train.py		hf_train.py
initialize_hf_project.py		initialize_hf_project.py
initialize_project.py		initialize_project.py
json_formatter.py		json_formatter.py
log_config.yaml		log_config.yaml
logging_setup.py		logging_setup.py
run_comfy.py		run_comfy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cloudforge-ml

current status

quick start

runpod/ssh setup (required)

standard workflow

custom workflow

features

detailed usage info

deploying a custom project to RunPod

training a huggingface model

initializing a custom huggingface run

examples to help you get started

common errors

About

Uh oh!

Releases

Packages

Languages

License

arb8020/cloudforge-ml

Folders and files

Latest commit

History

Repository files navigation

cloudforge-ml

current status

quick start

runpod/ssh setup (required)

standard workflow

custom workflow

features

detailed usage info

deploying a custom project to RunPod

training a huggingface model

initializing a custom huggingface run

examples to help you get started

common errors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages