Notice: This repository is a modified version of @Kyyle2114/Pytorch-Template, enhanced to support training with Hugging Face's Accelerate and improved with comprehensive type hints, error handling, and best practices.
Structured and modular template for deep learning model training with hugging face's accelerate.
- Distributed Training: Full support via Hugging Face Accelerate
- Type Safety: Comprehensive type hints throughout the codebase
- Error Handling: Robust exception handling with specific error types
- Modular Design: Clean separation of concerns with well-defined interfaces
- Configuration Management: YAML-based configuration with Pydantic validation
- Weights & Biases Integration: Automatic experiment tracking and visualization
- Early Stopping: Configurable early stopping to prevent overfitting
- Model Checkpointing: Automatic saving of best and last models
- Progress Tracking: Detailed logging of training progress and metrics
- PEP 8 Compliance: Consistent code formatting with Ruff
- Documentation: Google-style docstrings for all functions and classes
- Input Validation: Pydantic-based config validation with informative error messages
- Memory Management: Efficient memory usage with proper cleanup
- Learning Rate Scheduling: Cosine annealing with warm-up restarts
- Gradient Accumulation: Support for effective batch size scaling
- Mixed Precision: FP16 training for improved performance
- Gradient Clipping: Configurable gradient norm clipping
- Python 3.10 or higher
- CUDA-compatible GPU (optional but recommended)
- Git
- Clone the repository:
git clone https://github.com/your-username/Pytorch-Template-Accelerate.git
cd Pytorch-Template-Accelerate- Create virtual environment:
# Using conda
conda create -n pytorch-template python=3.10
conda activate pytorch-template- Install dependencies:
pip install -r requirements.txt- Configure Accelerate (first time only):
accelerate configchmod +x train.sh
bash train.sh# Single GPU
python main.py --config config/default.yaml
# Multi-GPU with Accelerate
accelerate launch main.py --config config/default.yaml
# With config overrides
python main.py --config config/default.yaml --set training.lr=0.0001 --set data.batch_size=32Pytorch-Template-Accelerate/
βββ config/
β βββ __init__.py # Config package initialization
β βββ default.yaml # Default configuration file
β βββ schemas.py # Pydantic configuration schemas
βββ engines/
β βββ __init__.py # Engine package initialization
β βββ engine_train.py # Training and evaluation logic
βββ models/
β βββ __init__.py # Models package initialization
β βββ cnn.py # CNN model implementation
βββ utils/
β βββ __init__.py # Utils package initialization
β βββ datasets.py # Dataset handling utilities
β βββ lr_sched.py # Learning rate schedulers
β βββ misc.py # Miscellaneous utilities
βββ dataset/ # Dataset storage
βββ output_dir/ # Training outputs (auto-created)
βββ main.py # Main training script
βββ train.sh # Training execution script
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
βββ README.md # This file
Configuration is managed through YAML files with Pydantic validation. The default configuration file is config/default.yaml:
# --- General Configuration ---
general:
seed: 42
output_dir: ./output_dir
# --- Data Configuration ---
data:
dataset_path: ./dataset
batch_size: 16
num_workers: 4
# --- Training Configuration ---
training:
epoch: 100
patience: 50
lr: 1e-3
weight_decay: 1e-4
grad_accum_steps: 1
warmup_epochs: 10
clip_grad: null # null for no gradient clipping
# --- WandB Configuration ---
wandb:
project_name: Model-Training
run_name: Model-Training| Argument | Type | Default | Description |
|---|---|---|---|
-c, --config |
str | config/default.yaml | Path to YAML configuration file |
--help_config |
flag | - | Show detailed help for configuration parameters |
--set |
KEY=VALUE | - | Override any config value (can be used multiple times) |
# Override single value
python main.py --config config/default.yaml --set training.lr=0.0001
# Override multiple values
python main.py --config config/default.yaml \
--set general.seed=42 \
--set data.batch_size=32 \
--set training.epoch=50
# Show config help
python main.py --help_config| Section | Parameter | Type | Default | Description |
|---|---|---|---|---|
| general | seed | int | 42 | Random seed for reproducibility |
| output_dir | str | ./output_dir | Output directory for checkpoints and logs | |
| data | dataset_path | str | ./dataset | Dataset root path |
| batch_size | int | 16 | Batch size per GPU | |
| num_workers | int | 4 | Number of data loading workers | |
| training | epoch | int | 100 | Total number of training epochs |
| patience | int | 50 | Early stopping patience | |
| lr | float | 1e-3 | Base learning rate | |
| weight_decay | float | 1e-4 | Weight decay for optimizer | |
| grad_accum_steps | int | 1 | Gradient accumulation steps | |
| warmup_epochs | int | 10 | Number of warmup epochs | |
| clip_grad | float/null | null | Gradient clipping norm (null for no clipping) | |
| wandb | project_name | str | Model-Training | WandB project name |
| run_name | str | Model-Training | WandB run name |
- Create your model in
models/your_model.py:
import torch.nn as nn
from typing import Optional
class YourModel(nn.Module):
def __init__(self, num_classes: int = 10) -> None:
super().__init__()
# Your model implementation
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Forward pass implementation
return x- Update
models/__init__.py:
from .your_model import YourModel
__all__ = ['SimpleCNNforCIFAR10', 'YourModel']- Modify
main.pyto use your model:
model = models.YourModel(num_classes=10)- Implement your dataset in
utils/datasets.py:
class YourDataset(Dataset):
def __init__(self, data_path: str, transform: Optional[torch.nn.Module] = None):
# Your dataset implementation
pass- Create a factory function:
def make_your_dataset(dataset_path: str, train: bool = True, transform=None):
return YourDataset(dataset_path, transform)- Create a new YAML file (e.g.,
config/my_config.yaml):
general:
seed: 123
output_dir: ./my_experiment
data:
dataset_path: /path/to/your/data
batch_size: 64
num_workers: 8
training:
epoch: 200
patience: 30
lr: 5e-4
weight_decay: 1e-5
grad_accum_steps: 2
warmup_epochs: 5
clip_grad: 1.0
wandb:
project_name: My-Project
run_name: experiment-1- Run training with your config:
python main.py --config config/my_config.yaml1. CUDA Out of Memory
- Reduce
data.batch_sizein your config - Increase
training.grad_accum_stepsto maintain effective batch size
2. WandB Login Issues
- Set
WANDB_API_KEYenvironment variable - Use
wandb logincommand - Set
WANDB_MODE=offlinefor offline logging
3. Import Errors
- Ensure all dependencies are installed:
pip install -r requirements.txt - Check Python version: requires 3.10+
4. Multi-GPU Training Issues
- Run
accelerate configto set up distributed training - Ensure
CUDA_VISIBLE_DEVICESis set correctly intrain.sh
5. Config Validation Errors
- Check YAML syntax in your config file
- Ensure all required fields are present
- Use
--help_configto see available parameters
Training metrics automatically logged to WandB include:
- Training/validation loss
- Learning rate schedule
- Model accuracy
- Best model metrics (epoch, loss, accuracy)
- Hardware utilization
- Hyperparameter configurations
- JSON logs:
output_dir/log.txt - Best model:
output_dir/best_model/ - Last model:
output_dir/last_model/ - Configuration:
output_dir/config.yaml
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with proper type hints and documentation
- Run tests and ensure code quality
- Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
- Hugging Face Accelerate Documentation
- PyTorch Distributed Training
- Weights & Biases
- Pydantic Documentation
- Original Template by @Kyyle2114
β Star this repository if you find it helpful!