diff --git a/skills/hugging-face-model-trainer/SKILL.md b/skills/hugging-face-model-trainer/SKILL.md
index f8affed..22a1144 100644
--- a/skills/hugging-face-model-trainer/SKILL.md
+++ b/skills/hugging-face-model-trainer/SKILL.md
@@ -672,6 +672,7 @@ Add to PEP 723 header:
- `references/hardware_guide.md` - Hardware specs and selection
- `references/hub_saving.md` - Hub authentication troubleshooting
- `references/troubleshooting.md` - Common issues and solutions
+- `references/local_training_macos.md` - Local training on macOS
### Scripts (In This Skill)
- `scripts/train_sft_example.py` - Production SFT template
diff --git a/skills/hugging-face-model-trainer/references/local_training_macos.md b/skills/hugging-face-model-trainer/references/local_training_macos.md
new file mode 100644
index 0000000..400c6f5
--- /dev/null
+++ b/skills/hugging-face-model-trainer/references/local_training_macos.md
@@ -0,0 +1,603 @@
+# Local training on macOS (Apple Silicon)
+
+This reference explains how to run **small fine-tuning jobs locally on a Mac** (best for smoke tests and quick iteration).
+
+> **Clarification:** This is not about "training on phone." The workflow is: **train locally or cloud → export/quantize → run inference on-device**.
+
+For general issues, see [troubleshooting.md](troubleshooting.md). This doc focuses on **macOS-specific** setup and issues.
+
+## Quickstart (TL;DR)
+
+```bash
+# 1. Setup (see detailed steps in Setup section below)
+xcode-select --install
+python3 -m venv .venv && source .venv/bin/activate
+python -m pip install -U "torch>=2.2" "transformers>=4.40" "trl>=0.12" "peft>=0.10" \
+ datasets accelerate safetensors huggingface_hub
+
+# 2. Run smoke test (see full script below)
+python train_lora_sft.py
+
+# 3. Verify
+ls outputs/local-lora/ # should contain adapter weights (*.safetensors) + adapter_config.json
+```
+
+> **Note:** If you hit `TypeError` or version conflicts, see [troubleshooting.md](troubleshooting.md) for pinning guidance.
+
+
+Recommended requirements.txt for reproducibility
+
+```txt
+torch>=2.2,<3.0
+transformers>=4.40,<5.0
+trl>=0.12,<1.0
+peft>=0.10,<1.0
+datasets>=2.18,<3.0
+accelerate>=0.28,<1.0
+safetensors>=0.4,<1.0
+huggingface_hub>=0.21,<1.0
+```
+
+Install with:
+```bash
+pip install -r requirements.txt
+```
+
+
+
+## Agent Decision Rubric
+
+**Run locally on Mac when:**
+- Model ≤3B parameters (text-only)
+- Short context (≤1024 tokens)
+- LoRA/PEFT fine-tuning
+- Quick smoke test or dataset validation
+
+**Recommend HF Jobs / cloud GPU when:**
+- Model 7B+ parameters
+- Vision-language models (VLMs)
+- QLoRA 4-bit training (CUDA/bitsandbytes-centric)
+- Long context or full fine-tuning
+- Production training runs
+
+## Scope
+
+✅ Good for (local Mac):
+- Quick experiments / smoke tests
+- **Text-only** models ~0.5B–3B
+- **LoRA SFT** with small batches + short context
+
+⚠️ Usually not good on a Mac laptop:
+- Large models (7B+) for a pleasant dev loop
+- **QLoRA 4-bit training** (often CUDA/bitsandbytes-centric)
+- Vision-language fine-tuning (VLMs) at real scale
+
+If you need VLM training (e.g., LLaVA/Qwen-VL) or larger models, prefer **HF Jobs / cloud GPU** and keep local for validation.
+
+## How this fits the model-trainer skill
+
+This skill primarily targets **HF Jobs** (cloud training). Local Mac training is most useful to:
+- Validate dataset formatting and prompt templates
+- Confirm your LoRA setup works end-to-end
+- Run quick regression tests before submitting a real HF Jobs run
+
+Typical workflow:
+1. Run a small local LoRA smoke test (this doc)
+2. Move the "real run" to HF Jobs with the same model/dataset/hyperparams
+3. After training, export/quantize as needed (see [gguf_conversion.md](gguf_conversion.md) for on-device inference)
+
+## Before you start
+
+### Recommended "local-friendly" defaults
+- Model: 0.5B–1.5B for first run
+- Max seq length: 512–1024
+- Batch size: 1
+- Gradient accumulation: 8–16
+- LoRA: r=8–16, alpha=16–32
+- Save only adapters (small artifacts)
+
+### Rough memory guidance (Apple Silicon unified memory)
+Very approximate; depends on context length and model:
+- 16GB: start with ~0.5B–1.5B
+- 32GB: ~1.5B–3B
+- 64GB: larger experiments possible, but long-context can still blow up
+
+## Setup
+
+### 0) Xcode CLT
+```bash
+xcode-select --install
+```
+
+### 1) Create a venv
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python -m pip install -U pip
+```
+
+### 2) Install training deps
+```bash
+python -m pip install -U "torch>=2.2" "transformers>=4.40" "trl>=0.12" "peft>=0.10" \
+ datasets accelerate safetensors huggingface_hub
+```
+
+**Note:** Use a recent stable PyTorch version; MPS support improves frequently. Check your version:
+```bash
+python -c "import torch; print(torch.__version__, '| MPS available:', torch.backends.mps.is_available())"
+```
+
+### 3) (Optional) Configure Accelerate
+```bash
+accelerate config
+```
+
+Suggested answers for local Mac:
+- compute environment: local machine
+- distributed: no
+- mixed precision: no (recommended for MPS stability)
+- device: MPS (if offered)
+
+### 4) (Optional) Login to Hugging Face
+Only needed if you'll push artifacts to the Hub:
+```bash
+huggingface-cli login
+```
+
+## Run: Local LoRA SFT smoke test
+
+### Why this recipe?
+- Works without CUDA-specific toolchains
+- Uses conservative settings for Mac (small batch, gradient accumulation, checkpointing)
+- Uses TRL API (`SFTConfig` + `processing_class`)
+
+### Fastest first run
+
+For the quickest smoke test, limit steps (not just dataset size):
+```bash
+# Ultra-fast: only 50 training steps
+MAX_STEPS=50 python train_lora_sft.py
+```
+
+Or use a tiny dataset slice:
+```bash
+DATASET_SPLIT="train_sft[:100]" MAX_SEQ_LENGTH=512 python train_lora_sft.py
+```
+
+### Using a local JSONL file
+
+Create `test_data.jsonl`:
+```jsonl
+{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]}
+{"messages": [{"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]}
+```
+
+Run with:
+```bash
+DATA_FILES="test_data.jsonl" python train_lora_sft.py
+```
+
+For text-only format (no chat template), create `test_text.jsonl`:
+```jsonl
+{"text": "User: Hello\nAssistant: Hi there!"}
+{"text": "User: What is 2+2?\nAssistant: 4"}
+```
+
+Run with:
+```bash
+DATA_FILES="test_text.jsonl" TEXT_FIELD="text" MESSAGES_FIELD="" python train_lora_sft.py
+```
+
+
+Full training script: train_lora_sft.py
+
+```python
+import os
+from dataclasses import dataclass
+from typing import Optional
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
+from peft import LoraConfig
+from trl import SFTTrainer, SFTConfig
+
+# Reproducibility
+set_seed(42)
+
+@dataclass
+class Cfg:
+ model_id: str = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
+ dataset_id: str = os.environ.get("DATASET_ID", "HuggingFaceH4/ultrachat_200k")
+ dataset_split: str = os.environ.get("DATASET_SPLIT", "train_sft[:500]")
+ data_files: Optional[str] = os.environ.get("DATA_FILES", None) # For local JSONL
+ text_field: str = os.environ.get("TEXT_FIELD", "")
+ messages_field: str = os.environ.get("MESSAGES_FIELD", "messages")
+ out_dir: str = os.environ.get("OUT_DIR", "outputs/local-lora")
+ max_seq_length: int = int(os.environ.get("MAX_SEQ_LENGTH", "512"))
+ max_steps: int = int(os.environ.get("MAX_STEPS", "-1")) # <=0 = use epochs
+
+cfg = Cfg()
+
+def get_device():
+ """Get the best available device for Mac training."""
+ if torch.backends.mps.is_available():
+ return "mps"
+ return "cpu"
+
+def get_dtype():
+ """
+ Get optimal dtype for stability.
+ - fp32 is the safest default for MPS
+ - bf16 works on M1 Pro/Max/Ultra and all M2/M3/M4 chips (PyTorch 2.1+)
+ - bf16 is NOT supported on base M1 or Intel Macs
+ - fp16 often causes NaN issues on MPS
+ """
+ return torch.float32
+
+device = get_device()
+print(f"Model: {cfg.model_id}")
+if cfg.data_files:
+ print(f"Dataset: local file ({cfg.data_files})")
+else:
+ print(f"Dataset: {cfg.dataset_id} ({cfg.dataset_split})")
+print(f"Device: {device}")
+print(f"Dtype: {get_dtype()}")
+if cfg.max_steps > 0:
+ print(f"Max steps: {cfg.max_steps}")
+
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(cfg.model_id, use_fast=True)
+if tokenizer.pad_token is None:
+ tokenizer.pad_token = tokenizer.eos_token
+tokenizer.padding_side = "right" # Important for causal LM training
+
+# Load model
+# Note: Some models require trust_remote_code=True
+model = AutoModelForCausalLM.from_pretrained(
+ cfg.model_id,
+ torch_dtype=get_dtype(),
+ # trust_remote_code=True, # Uncomment if model requires custom code
+)
+model.to(device)
+
+# Important: disable cache when using gradient checkpointing
+model.config.use_cache = False
+
+# Load dataset
+if cfg.data_files:
+ # Local JSONL file
+ ds = load_dataset("json", data_files=cfg.data_files, split="train")
+else:
+ # Hub dataset
+ ds = load_dataset(cfg.dataset_id, split=cfg.dataset_split)
+
+def format_example(ex):
+ # Case A: dataset already has plain text
+ if cfg.text_field and isinstance(ex.get(cfg.text_field), str):
+ ex["text"] = ex[cfg.text_field]
+ return ex
+
+ # Case B: chat-like datasets with messages list
+ msgs = ex.get(cfg.messages_field)
+ if isinstance(msgs, list):
+ if hasattr(tokenizer, "apply_chat_template"):
+ try:
+ ex["text"] = tokenizer.apply_chat_template(
+ msgs, tokenize=False, add_generation_prompt=False
+ )
+ return ex
+ except Exception:
+ pass
+ # Fallback: naive join (smoke test only)
+ ex["text"] = "\n".join([str(m) for m in msgs])
+ return ex
+
+ # Last resort
+ ex["text"] = str(ex)
+ return ex
+
+ds = ds.map(format_example)
+
+# Drop unused columns to reduce memory
+cols_to_keep = ["text"]
+ds = ds.remove_columns([c for c in ds.column_names if c not in cols_to_keep])
+
+# LoRA config
+# Note: target_modules vary by architecture.
+# Common patterns:
+# - Llama/Qwen/Mistral: q_proj, k_proj, v_proj, o_proj
+# - Some models add: gate_proj, up_proj, down_proj
+# If you get "module not found" errors, see Troubleshooting section below.
+lora = LoraConfig(
+ r=16,
+ lora_alpha=32,
+ lora_dropout=0.05,
+ bias="none",
+ task_type="CAUSAL_LM",
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+)
+
+# SFT config
+# Note: TRL API evolves; some param names may differ across versions.
+# If you hit TypeErrors, see troubleshooting.md for API mismatches.
+# Build config dict conditionally to avoid passing None (which can break some TRL versions)
+sft_kwargs = dict(
+ output_dir=cfg.out_dir,
+ per_device_train_batch_size=1,
+ gradient_accumulation_steps=8,
+ learning_rate=2e-4,
+ logging_steps=10,
+ save_steps=200,
+ save_total_limit=2,
+ gradient_checkpointing=True,
+ report_to="none",
+ # MPS stability: fp32 is safest. Only enable fp16/bf16 if tested on your machine.
+ fp16=False,
+ bf16=False,
+ max_seq_length=cfg.max_seq_length,
+ dataset_text_field="text",
+)
+
+if cfg.max_steps > 0:
+ sft_kwargs["max_steps"] = cfg.max_steps
+else:
+ sft_kwargs["num_train_epochs"] = 1
+
+sft_args = SFTConfig(**sft_kwargs)
+
+trainer = SFTTrainer(
+ model=model,
+ train_dataset=ds,
+ peft_config=lora,
+ args=sft_args,
+ processing_class=tokenizer,
+)
+
+trainer.train()
+trainer.save_model(cfg.out_dir)
+print(f"✅ Saved to: {cfg.out_dir}")
+```
+
+
+
+### Run it
+
+```bash
+python train_lora_sft.py
+```
+
+Optional overrides:
+```bash
+# Different model
+MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct" python train_lora_sft.py
+
+# Quick 50-step test
+MAX_STEPS=50 python train_lora_sft.py
+
+# Local data file
+DATA_FILES="my_data.jsonl" python train_lora_sft.py
+```
+
+If MPS is flaky, run with fallback:
+```bash
+PYTORCH_ENABLE_MPS_FALLBACK=1 python train_lora_sft.py
+```
+
+To help with memory pressure on Mac:
+```bash
+PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 python train_lora_sft.py
+```
+
+> **Caution:** Setting `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` disables memory limits and may cause system instability or unresponsiveness if memory is exhausted. Use with care and monitor Activity Monitor.
+
+## What success looks like
+
+✅ Good signs:
+- Loss decreases over steps
+- Output directory contains adapter weights + config
+- A quick generation test runs without errors
+
+**Expected training output:**
+```
+Model: Qwen/Qwen2.5-0.5B-Instruct
+Dataset: HuggingFaceH4/ultrachat_200k (train_sft[:500])
+Device: mps
+Dtype: torch.float32
+{'loss': 2.1453, 'grad_norm': 1.234, 'learning_rate': 0.0002, 'epoch': 0.16}
+{'loss': 1.8721, 'grad_norm': 0.987, 'learning_rate': 0.0002, 'epoch': 0.32}
+...
+✅ Saved to: outputs/local-lora
+```
+
+If loss is NaN / exploding:
+- Ensure `fp16=False` (default in the script above)
+- Reduce learning rate (e.g., 2e-4 → 1e-4 or 5e-5)
+- Shorten max sequence length
+
+## After: quick local evaluation
+
+
+Evaluation script: eval_generate.py
+
+```python
+import os
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+
+BASE = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
+ADAPTER = os.environ.get("ADAPTER_DIR", "outputs/local-lora")
+
+device = "mps" if torch.backends.mps.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained(BASE, use_fast=True)
+model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float32)
+model.to(device)
+model = PeftModel.from_pretrained(model, ADAPTER)
+
+prompt = os.environ.get("PROMPT", "Explain gradient accumulation in 3 bullet points.")
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+with torch.no_grad():
+ out = model.generate(
+ **inputs,
+ max_new_tokens=120,
+ do_sample=True,
+ temperature=0.7,
+ top_p=0.9,
+ )
+
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+
+
+
+Run:
+```bash
+python eval_generate.py
+```
+
+## After: push artifacts to the Hub (optional)
+
+Prefer uploading adapters for local runs (smaller artifacts).
+See also: [hub_saving.md](hub_saving.md)
+
+```bash
+huggingface-cli login
+```
+
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_folder(
+ folder_path="outputs/local-lora",
+ repo_id="YOUR_USERNAME/YOUR_ADAPTER_REPO",
+ repo_type="model",
+)
+```
+
+## After: transition from local → HF Jobs
+
+Use local runs to validate dataset formatting, chat template, LoRA target modules, and general stability.
+
+Then run the real job on HF Jobs:
+- Keep `MODEL_ID`, dataset id, prompt formatting, and hyperparams consistent
+- Bump dataset size + context length gradually
+- Track results with the skill's normal workflow
+
+## After: export for on-device inference
+
+If your end goal is on-device inference (e.g., GGUF/llama.cpp):
+1. Train (local or cloud)
+2. Merge adapters (if needed)
+3. Convert + quantize
+
+See [gguf_conversion.md](gguf_conversion.md) for details.
+
+## Troubleshooting (macOS)
+
+For broader issues, see [troubleshooting.md](troubleshooting.md).
+This section covers **Mac-specific** problems.
+
+### MPS errors (unsupported op / crash)
+Some PyTorch ops aren't fully supported on MPS. Use CPU fallback:
+
+```bash
+PYTORCH_ENABLE_MPS_FALLBACK=1 python train_lora_sft.py
+```
+
+This can be slower but usually prevents crashes.
+
+### Monitoring GPU memory usage
+To monitor MPS memory during training:
+
+**Activity Monitor (GUI):**
+1. Open Activity Monitor → Window → GPU History
+2. Watch "Memory Used" during training
+
+**Command line:**
+```bash
+# Monitor GPU power and memory (requires sudo)
+sudo powermetrics --samplers gpu_power -i 1000
+
+# Or use this Python snippet during training:
+python -c "import torch; print(f'MPS allocated: {torch.mps.driver_allocated_memory() / 1e9:.2f} GB')"
+```
+
+**In your training script**, you can add periodic memory logging:
+```python
+if torch.backends.mps.is_available():
+ print(f"MPS memory: {torch.mps.driver_allocated_memory() / 1e9:.2f} GB")
+```
+
+### Out of memory (OOM)
+If training crashes or your Mac becomes unstable:
+- Reduce `MAX_SEQ_LENGTH` (1024 → 768 → 512)
+- Use a smaller model (e.g., 0.5B instead of 1.5B)
+- Set memory high watermark (use with caution):
+ ```bash
+ PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 python train_lora_sft.py
+ ```
+ **Warning:** This may cause system instability. Monitor Activity Monitor → Memory Pressure.
+- Use fewer examples (e.g., `train_sft[:500]` → `train_sft[:100]`)
+- Keep batch size at 1 and scale with gradient accumulation
+- Close other memory-heavy apps
+
+### fp16 instability (NaNs / loss explodes)
+If loss becomes `nan` or suddenly explodes:
+- Set `fp16=False` in the script (recommended default for MPS)
+- Lower learning rate (2e-4 → 1e-4 → 5e-5)
+- Reduce `MAX_SEQ_LENGTH`
+
+### Intel Macs (not supported)
+Intel Macs don't have MPS acceleration, so training would be CPU-only and impractically slow. This guide has not been tested on Intel Macs. If you're on Intel, use HF Jobs or cloud GPU for all training.
+
+### LoRA target module mismatch
+If you see "module not found" errors, the model architecture uses different module names.
+
+**Quick debug:**
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("your-model-id")
+# Print likely LoRA targets (attention projections)
+for name, _ in model.named_modules():
+ if any(x in name.lower() for x in ["q_proj", "k_proj", "v_proj", "o_proj", "query", "key", "value", "dense"]):
+ print(name)
+```
+
+**Common patterns:**
+| Architecture | target_modules |
+|--------------|----------------|
+| Llama/Qwen/Mistral | `q_proj`, `k_proj`, `v_proj`, `o_proj` |
+| GPT-2/GPT-J | `c_attn`, `c_proj` |
+| BLOOM | `query_key_value`, `dense` |
+
+### TRL version differences
+The script uses `SFTConfig` + `processing_class` (TRL ≥0.12). If you hit `TypeError` on arguments like `max_seq_length` vs `max_length`, or `dataset_text_field` issues, check your TRL version and see [troubleshooting.md](troubleshooting.md) for API mismatches.
+
+## Alternative: MLX for Apple Silicon
+
+[MLX](https://github.com/ml-explore/mlx) is Apple's machine learning framework optimized for Apple Silicon. While this guide focuses on PyTorch + MPS (for compatibility with HF ecosystem), MLX can offer better performance on Mac for some workflows.
+
+**When to consider MLX:**
+- You're doing inference-heavy workflows on Apple Silicon
+- You want tighter Metal/GPU integration
+- You're building Mac-native ML applications
+
+**MLX limitations for this skill:**
+- Smaller ecosystem than PyTorch/HF
+- Not all HF models have MLX ports
+- Training APIs are less mature than TRL/PEFT
+- Harder to transfer workflows to cloud GPU
+
+**Resources:**
+- [mlx-lm](https://github.com/ml-explore/mlx-lm/tree/main/mlx_lm) – LLM inference and fine-tuning with MLX
+- [MLX documentation](https://ml-explore.github.io/mlx/)
+
+For this skill's workflow (local validation → HF Jobs), PyTorch + MPS remains the recommended path for consistency.
+
+## See Also
+
+- [troubleshooting.md](troubleshooting.md) – General TRL troubleshooting
+- [hardware_guide.md](hardware_guide.md) – GPU selection for HF Jobs
+- [gguf_conversion.md](gguf_conversion.md) – Export for on-device inference
+- [training_methods.md](training_methods.md) – SFT, DPO, GRPO overview
\ No newline at end of file