goal: ~one-click deployment for hugging face models/datasets on arbitrary cloud compute platforms
[demo.gif] # tbd
- supported providers: runpod
- low-friction workflows
- initialize projects with one command using initialize_project.py
- easy setup with default config.yaml and bash script that runs on your pod
- deploy files via scp and run your scripts with one command using deploy_runpod.py
- example recipes
- huggingface: gpt2 training with tiny-shakespeare
- custom models: deploy models that inherit from hf architectures
- custom projects: define your own training scripts with minimal setup
- get api key: settings > api keys (https://www.runpod.io/console/user/settings)
- create .env:
RUNPOD_API_KEY=your_key_here - make ssh key:
ssh-keygen -t ed25519 -C "[email protected]" - getting key: use
caton the path it tells you it saved to, to copy the key - add key: settings > ssh keys (https://www.runpod.io/console/user/settings)
# train any HF model on any dataset with automatic cost tracking
uv run hf_train.py --model openai-community/gpt2 --dataset karpathy/tiny_shakespeare
# or use your own model/dataset
uv run initialize_hf_project.py my_project
uv run hf_train.py --model ./projects/my_project/model.py --dataset ./projects/my_project/dataset.py# initialize project
uv run initialize_project.py my_project
# update config.yaml, run_script.sh, script.py, files to scp over as needed
# check out example_cifar to see how you might clone and run a personal repository
uv run deploy_runpod.py --project my_project- automatic cost tracking and budget controls
- graceful error handling and cleanup
- environment and SSH key management
- modular design for provider expansion
if you want to create a custom project, run:
uv run initialize_project.py my_projectthis will create projects/my_project with:
- config.yaml (your deployment config)
- run_script.sh (default bash script to run on the pod)
- script.py (a sample python script) from there, edit script.py, add files, install dependencies, etc.
once you're ready you can use '''bash uv run deploy_runpod.py --project=my_project ''' this will
- look up projects/my_project/config.yaml
- create a new pod on RunPod
- upload the files in your project directory
- execute run_script.sh
- terminate the pod on completion/failed deployment unless --keep-alive is specified
if you want to train a text generation model with huggingface
uv run hf_train.py --model <HF_MODEL_OR_LOCAL.py> --dataset <HF_DATASET_OR_LOCAL.py> [--keep-alive]example: train GPT2 on tiny_shakespeare uv run hf_train.py --model gpt2 --dataset karpathy/tiny_shakespeare
key things to note:
- if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
- if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.
if you want to initialize training a custom text generation model with huggingface, and your own custom dataset
uv run initialize_hf_project.py my_hfexample: train mistral on tiny_shakespeare uv run hf_train.py --model mistralai/Mistral-7B-Instruct-v0.3 --dataset karpathy/tiny_shakespeare
key things to note:
- if you're using a gated model (like in the example), be sure you have access and you put your huggingface token in your .env
- if you pass a local .py file for --model or --dataset, the script automatically copies them into your project and uses them.
- if you omit --keep-alive, the pod terminates after training. Otherwise, it’ll drop you into an SSH session when done.
There are a few sample projects in projects:
example_cifar: clones a CIFAR10 speedrun repository, installs deps, and runs training.
uv run deploy_runpod.py --project=example_cifarexample_gpt2: sets up training for gpt2 on the tiny shakespeare dataset
uv run deploy_runpod.py --project=example_gpt2example_gpt2: basic GPT-2 text training script with HF + datasets. example_hf: Another example that demonstrates using hf_train.py with a local script.
ERROR - SCP upload failed: Read from remote host Connection reset by peer```
runpod connections can be finicky, just try again (will eventually replace this with rsync), this has been acting up more recently (past 24 hours), reason unknown
```bash
Pod is running but runtime information is not yet available. Waiting...```
if you get stuck here for over a few minutes (note: comfy workflow can take up to 10min), check the logs on runpod.io for any errors and manually terminate - there may be something wrong with your custom config parameters
## roadmap
- [5/5] core features
- [x] runpod integration
- [x] project initialization
- [x] file deployment
- [x] auto-ssh after script execution
- [x] one-command training for HF models/datasets
- [x] one-command initialization for ComfyUI text2img workflows
- [0/6] UI/UX
- [ ] logs could be cleaner, uv especially really hogs space in the logs
- [ ] runpod likes to randomly reset connection during scp file transfer every once in a blue moon
- [ ] better abstractions/code organization for continuing work (extracting templates, etc)
- [ ] better default pod naming
- [ ] smarter dependency management (selectively loading transformers optional dependencies like sentencepiece )
- [ ] smoother setup wizard
- [0/5] research features
- [ ] support for tasks beyond text generation
- [ ] advanced training (FSDP, checkpointing)
- [ ] spot instances + interruption handling
- [ ] wandb integration
- [ ] multi-GPU support
- [0/4] infrastructure
- [ ] provider abstraction layer
- [ ] vast.ai support
- [ ] aws/gcp integration
- [ ] cost optimization
- [0/2] huggingface
- [ ] support for tasks other than text generation
- [ ] pre + post training pipeline
- [0/2] comfyui
- [ ] customizable dockerfile/runpod template to change bootup behavior (automatically downloaded models, etc)
- [ ] bootup with more example workflows
- [0/4] recipes
- [ ] notebook recipe
- [ ] high performance recipe
- [ ] model chat interface recipe