Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,61 @@ npm run dev

Then open http://localhost:5173 in your browser.

## Run with Local Models (Ollama)

You can run the council against local, open‑source models you installed with Ollama (e.g. `nemotron`, `nemotron9b`, `nemotron12b`).

1) Ensure the Ollama daemon is running and models are installed:
```bash
ollama list
```

2) Option A — Switch the backend to local mode (with resource guard). Either set env vars:
```bash
export COUNCIL_PROVIDER=local
export LOCAL_MODELS="nemotron,nemotron9b" # customize as you like
# optional:
# export CHAIRMAN_LOCAL_MODEL=nemotron12b
# export COUNCIL_MAX_PARALLEL_LOCAL=2 # hard cap concurrent runs
# export COUNCIL_MEM_RESERVE_GB=6 # keep RAM reserved for OS/apps
```
…or edit `backend/config.py` accordingly.

3) Start the backend as usual:
```bash
uv run python -m backend.main
```

### Safeguards to avoid crashes
- Adaptive memory guard estimates each model’s memory weight (from `ollama /api/tags` or safe defaults) and keeps the total under a computed budget (≈ 60% of RAM or total minus reserve).
- Optional hard cap: set `COUNCIL_MAX_PARALLEL_LOCAL` to limit parallel local runs.
- Per-request timeout for local calls (default 180s) prevents hung requests.

This lets you run multiple efficient models in parallel while avoiding overload on machines with limited memory.

## 4. Custom Self‑Hosted Models (Optional)

Alternatively, you can point the council at any OpenAI‑compatible endpoint (e.g., vLLM, Ollama’s OpenAI proxy) without enabling the local resource guard:

In `backend/config.py`:

```python
CUSTOM_MODELS = {
"ollama/llama3": {
"api_url": "http://localhost:11434/v1/chat/completions",
"api_key": "ollama" # Optional, defaults to "custom"
}
}

# Don't forget to add it to the council list!
COUNCIL_MODELS = [
# ... other models
"ollama/llama3",
]
```

With this setup, the backend will route those IDs to the specified endpoints via the same OpenRouter client code path (no adaptive guard). See related discussion and example in the PR for custom models.

## Tech Stack

- **Backend:** FastAPI (Python 3.10+), async httpx, OpenRouter API
Expand Down
301 changes: 301 additions & 0 deletions backend/Local-AI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# Nemotron Nano 8B (Local)

NVIDIA's Llama-3.1-Nemotron-Nano-8B-v1 running locally via Ollama.

## About

A reasoning model fine-tuned by NVIDIA for:
- **Tool calling / function calling**
- **RAG (Retrieval Augmented Generation)**
- **Math & code reasoning**
- **General chat**

Based on Llama 3.1 8B Instruct. Supports 128K context length.

## Install Ollama (macOS / Linux / Windows)

### macOS (Apple Silicon or Intel)

```bash
# Option A (recommended): Homebrew
brew install ollama

# Option B: Official installer
curl -fsSL https://ollama.com/install.sh | sh

# Verify and quick test
ollama --version
ollama run llama3.2:3b "hi"
```

Notes:
- Homebrew is for macOS/Linux only. For Windows, see the Windows section below.
- Apple Silicon (M‑series) accelerates with Metal automatically.
- If you prefer a foreground server: `ollama serve` (Ctrl+C to stop).

### Linux (Ubuntu/Debian/Fedora/Arch)

```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Start/enable service (systemd distros)
sudo systemctl enable --now ollama

# Verify
ollama --version
ollama list
```

GPU (optional):
- NVIDIA: install recent NVIDIA driver + CUDA; verify with `nvidia-smi`.
- If GPU isn’t available, Ollama falls back to CPU.

### Windows 11/10

```powershell
# PowerShell (run as Administrator; requires winget)
winget install Ollama.Ollama
# If winget is unavailable, download the installer from: https://ollama.com

# New terminal:
ollama --version
ollama run llama3.2:3b "hi"
```

Notes:
- Allow Ollama (port 11434) through Windows Firewall on first run.
- Alternative: WSL2 → install Ubuntu, then follow Linux steps inside WSL.

## Add Models to Ollama

You can either pull from the Ollama library or use local GGUF files.

### Option A — Pull from library (1‑line)

```bash
ollama pull llama3.2:3b
ollama run llama3.2:3b "hello"
```

### Option B — Use local GGUF (Nemotron examples below)

See “Download” and “Additional Models (Optional)” sections for GGUF URLs and `Modelfile` examples to register:

```bash
# Example flow
curl -L -o model.gguf "https://huggingface.co/.../model-Q4_K_M.gguf"
cat > Modelfile << 'EOF'
FROM ./model.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
EOF
ollama create mymodel -f Modelfile
ollama run mymodel "hi"
```

## Usage

ollama run nemotron

If you’ve registered additional models:

- nemotron9b: `ollama run nemotron9b`
- nemotron12b: `ollama run nemotron12b`

### Reasoning Mode

Toggle reasoning with the system prompt "detailed thinking on" or "detailed thinking off".

### Recommended Settings

- **Reasoning ON**: temperature 0.6, top_p 0.95
- **Reasoning OFF**: greedy decoding (temperature 0)

## Download

Download the GGUF weights locally:
(Already downloaded)
```bash
curl -L -o nemotron-nano.gguf "https://huggingface.co/bartowski/nvidia_Llama-3.1-Nemotron-Nano-8B-v1-GGUF/resolve/main/nvidia_Llama-3.1-Nemotron-Nano-8B-v1-Q4_K_M.gguf"
```

## Files

- Modelfile - Ollama model config
- nemotron-nano.gguf - Model weights (Q4_K_M quantization, ~4.7GB)
- nemotron-9b-v2.gguf - Optional upgrade (Q4_K_M, ~6.1GB)
- nemotron-12b-v2.gguf - Optional upgrade (Q4_K_M, ~7.1GB)

## Use with LLM Council (Local)

Run Karpathy’s LLM Council fully offline using your local Ollama models.

1) Verify Ollama and models

```bash
ollama list
```

2) Install Council deps

```bash
cd Council/llm-council
uv sync
cd frontend && npm install && cd ..
```

3) Configure local mode (.env in Council/llm-council)

```bash
cat > .env <<'EOF'
COUNCIL_PROVIDER=local
LOCAL_MODELS=nemotron,nemotron9b,nemotron12b
CHAIRMAN_LOCAL_MODEL=nemotron12b
COUNCIL_MAX_PARALLEL_LOCAL=2
COUNCIL_MEM_RESERVE_GB=6
# OLLAMA_BASE_URL=http://127.0.0.1:11434
EOF
```

4) Start servers

- Option A:

```bash
./start.sh
```

- Option B:

```bash
uv run python -m backend.main
# in a new terminal:
cd frontend && npm run dev
```

Open http://localhost:5173.

5) Smoke test via CLI (optional)

```bash
bash scripts/council.sh -m "Say hi in five words"
bash scripts/council.sh --stream -m "One fun fact about space."
```

Notes
- Ensure `LOCAL_MODELS` names match `ollama list` (omit tags like `:latest`).
- If the machine is tight on RAM, lower `LOCAL_MODELS` count or set `COUNCIL_MAX_PARALLEL_LOCAL=1–2`.
- `COUNCIL_MEM_RESERVE_GB` keeps headroom for the OS/apps; increase if needed.

### RAM Sizing & Safeguards (Council Local Mode)
- Effective model budget ≈ `max(60% of RAM, RAM − COUNCIL_MEM_RESERVE_GB)`.
- Recommended on 48 GiB Macs:
- Light multitasking: `COUNCIL_MEM_RESERVE_GB=8`
- Heavy multitasking: `COUNCIL_MEM_RESERVE_GB=10–12`
- To reduce pressure: lower `COUNCIL_MAX_PARALLEL_LOCAL` or remove a model from `LOCAL_MODELS`.

## Additional Models (Optional)

### NVIDIA Nemotron Nano 9B v2

Download:

```bash
curl -L -o nemotron-9b-v2.gguf "https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/resolve/main/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M.gguf"
```

Register with Ollama:

```bash
cat > Modelfile-9b << 'EOF'
FROM ./nemotron-9b-v2.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER temperature 0.6
PARAMETER top_p 0.95
EOF

ollama create nemotron9b -f Modelfile-9b
```

Run:

```bash
ollama run nemotron9b "What are you?"
```

---

### NVIDIA Nemotron Nano 12B v2

Download:

```bash
curl -L -o nemotron-12b-v2.gguf "https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-12B-v2-GGUF/resolve/main/nvidia_NVIDIA-Nemotron-Nano-12B-v2-Q4_K_M.gguf"
```

Register with Ollama:

```bash
cat > Modelfile-12b << 'EOF'
FROM ./nemotron-12b-v2.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER temperature 0.6
PARAMETER top_p 0.95
EOF

ollama create nemotron12b -f Modelfile-12b
```

Run:

```bash
ollama run nemotron12b "What are you?"
```

## Manage Models & Disk Space

```bash
# See installed models and sizes
ollama list

# Remove a model to free space
ollama rm nemotron9b
```

Notes:
- GGUF files are several GB each; keep an eye on free disk space before downloads.
- If using Council in local mode, ensure `LOCAL_MODELS` only includes models you actually need.

## Troubleshooting

- Error: supplied file was not in GGUF format
This usually means the downloaded file was an HTML page, not a .gguf. Make sure you:
- Use the Hugging Face “resolve/main/... .gguf” URL.
- Pass -L to curl to follow redirects.
- Verify file size is several GB (ls -lh). Re-download if it’s only KB/MB.
- zsh: command not found: llama
Use `ollama run ...` instead of `llama run`.
- Connection refused to http://127.0.0.1:11434
Start Ollama:
- macOS/Windows: launch the app or run `ollama serve`
- Linux: `sudo systemctl enable --now ollama`
- Port 11434 already in use
Stop whatever is bound to 11434, or change Ollama port/host (e.g., `OLLAMA_HOST=127.0.0.1:11435 ollama serve`) and update `OLLAMA_BASE_URL` in `.env`.

## Source

- Official: https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1
- GGUF: https://huggingface.co/bartowski/nvidia_Llama-3.1-Nemotron-Nano-8B-v1-GGUF
- 9B v2: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2
- 12B v2: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2

## License

NVIDIA Open Model License + Llama 3.1 Community License
Loading