OpenClaw-On-Android/docs/local-llm.mdx at master · PsProsen-Dev/OpenClaw-On-Android

title	Local LLM on Android
description	Run local LLM inference via node-llama-cpp and Ollama on your Android device.

Overview

OpenClaw supports local LLM inference via node-llama-cpp and Ollama integration. The prebuilt native binary (@node-llama-cpp/linux-arm64) is included with the installation and loads successfully under the glibc environment — local LLM is technically functional on the phone.

However, there are practical constraints to consider before running local models.

**☁️ Cloud Models Available**: Ollama now supports cloud-hosted models! Use `ollama launch openclaw --model kimi-k2.5:cloud` for superior performance without local resource usage. See [Cloud Models](#ollama-cloud-models) section below.

⚠️ Practical Constraints

Constraint	Details
RAM	GGUF models need at least 2-4GB of free memory (7B model, Q4 quantization). Phone RAM is shared with Android and other apps
Storage	Model files range from 4GB to 70GB+. Phone storage fills up fast
Speed	CPU-only inference on ARM is very slow. Android does not support GPU offloading for llama.cpp
Use Case	OpenClaw primarily routes to cloud LLM APIs (OpenAI, Gemini, etc.) which respond at the same speed as on a PC. Local inference is a supplementary feature

For **experimentation**, small models like **TinyLlama 1.1B (Q4, ~670MB)** can run on the phone. For **production use**, cloud LLM providers are recommended.

☁️ Ollama Cloud Models

Best of both worlds: Run models in the cloud with Ollama's cloud integration — no local RAM/storage constraints!

Quick Start

# Pull and launch with cloud model
ollama pull kimi-k2.5:cloud
ollama launch openclaw --model kimi-k2.5:cloud

Recommended Cloud Models

Model	Use Case	Context
`kimi-k2.5:cloud`	Multimodal reasoning with subagents	64k+ tokens
`minimax-m2.5:cloud`	Fast, efficient coding	64k+ tokens
`glm-5:cloud`	Reasoning and code generation	64k+ tokens
`gpt-oss:120b-cloud`	High-performance tasks	128k tokens
`gpt-oss:20b`	Balanced performance	64k tokens

Commands

Command	Description
`ollama launch openclaw`	Launch with model selector
`ollama launch openclaw --model <model>`	Launch with specific cloud model
`ollama launch openclaw --config`	Configure without launching
`ollama pull <model>:cloud`	Pull cloud model to local registry

Why Cloud Models?

Advantage	Details
No Local Resources	Zero RAM/storage usage on phone
Superior Performance	Full GPU acceleration on cloud servers
Large Context	64k-128k token windows available
Always Updated	Latest model versions automatically
Privacy Option	Local models still available for sensitive data

💡 Recommendation: Use cloud models for production workloads, local models for testing/experimentation.

🚀 Quick Start

Option 1: node-llama-cpp (Recommended for Android)

Why --ignore-scripts? The installer uses npm install -g openclaw@latest --ignore-scripts because node-llama-cpp's postinstall script attempts to compile llama.cpp from source via cmake — a process that takes 30+ minutes on a phone and fails due to toolchain incompatibilities. The prebuilt binaries work without this compilation step, so the postinstall is safely skipped.

Install:

npm install -g node-llama-cpp --ignore-scripts

Download a model (TinyLlama 1.1B Q4 - good for testing):

mkdir -p ~/models
cd ~/models
curl -L -o tinyllama-1.1b-q4.gguf "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

Run inference:

node -e "
const { LlamaChatSession } = require('node-llama-cpp');
const session = new LlamaChatSession({
  modelPath: '/data/data/com.termux/files/home/models/tinyllama-1.1b-q4.gguf'
});
session.prompt('Hello, how are you?');
"

Option 2: Ollama (Full Server)

Ollama provides a complete local LLM server with model management.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Start the server:

ollama serve &

Pull a model:

# Small model for testing
ollama pull tinyllama

# Or larger models if you have RAM
ollama pull llama3.2:1b
ollama pull phi3:mini

Chat with a model:

ollama run tinyllama "Hello, how are you?"

API Endpoint:

curl http://localhost:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "Hello, how are you?"
}'

Ollama needs more RAM and storage than node-llama-cpp. Recommended only for devices with **6GB+ RAM** and **32GB+ free storage**.

🔗 Official Ollama OpenClaw Integration

OpenClaw officially integrates with Ollama to provide a seamless local AI assistant experience.

Why it's powerful

Native API Integration: OpenClaw connects directly to Ollama's native /api/chat endpoint. This ensures full support for streaming and tool calling.

⚠️ Important: Do not use the /v1 OpenAI-compatible URL with OpenClaw. It breaks tool calling and causes models to output raw JSON!
Automatic Model Discovery: OpenClaw queries /api/tags and /api/show to automatically find your downloaded Ollama models, detect if they support tool calling, and configure their context windows appropriately.

Setup Methods

Method A: Ollama Launcher (Recommended) The easiest way to connect OpenClaw to Ollama is using the official launcher command:

ollama launch openclaw

This setups the security profile, configures the provider, and sets your primary model. To launch a specific model directly:

# Example with cloud model
ollama launch openclaw --model kimi-k2.5:cloud

Method B: OpenClaw Onboarding Run the onboarding wizard and select "Ollama" when asked for a provider:

openclaw onboard

It will ask for your Ollama base URL (default is http://127.0.0.1:11434).

Method C: Explicit Configuration You can force OpenClaw to use Ollama by exporting the API key environment variable before starting the gateway:

export OLLAMA_API_KEY="ollama-local"
openclaw gateway

📊 Model Recommendations

Model	Size (Q4)	RAM Needed	Speed	Use Case
TinyLlama 1.1B	~670MB	2GB	Fast	Testing, experimentation
Phi-3 Mini (3.8B)	~2.3GB	4GB	Medium	Light tasks
Llama 3.2 1B	~670MB	2GB	Fast	Mobile-friendly
Llama 3.2 3B	~2GB	4GB	Medium	Balanced
Mistral 7B	~4.1GB	8GB	Slow	Advanced users only
Llama 3 8B	~4.7GB	8GB+	Very Slow	Not recommended

🔧 Configuration

node-llama-cpp Context Length

Reduce context length to save RAM:

const session = new LlamaChatSession({
  modelPath: 'path/to/model.gguf',
  contextSize: 2048  // Default is 4096
});

Ollama Configuration

Set environment variables before starting:

export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
ollama serve

🌐 Cloud vs Local Comparison

Feature	Local LLM	Cloud LLM (OpenClaw)	Ollama Cloud Models
Speed	Slow (CPU-only)	Fast (GPU-accelerated)	⚡ Fastest (cloud GPU)
Privacy	✅ Full privacy	Depends on provider	Depends on provider
Cost	Free (after hardware)	Pay-per-token	Free via Ollama
Model Size	Limited by RAM (2-8GB)	Unlimited	Unlimited
Context Window	2k-8k tokens	64k-200k tokens	64k-128k tokens
Setup	Manual download	One command	`ollama pull`
Internet	Not needed	Required	Required
RAM Usage	2-8GB	None	None
Storage	4-70GB	None	Minimal
Best For	Testing, offline	Production	Production + testing

🛠️ Troubleshooting

"Cannot find module 'node-llama-cpp'"

Make sure you installed with --ignore-scripts:

npm install -g node-llama-cpp --ignore-scripts

"Out of memory" error

Close other apps and reduce context size:

export NODE_OPTIONS="--max-old-space-size=1024"

Ollama killed by Android

Disable Phantom Process Killer:

adb shell settings put global development_settings_enabled 1
adb shell settings put global max_phantom_processes 64

Model download fails

Use a different mirror or download on PC and transfer:

# On PC
curl -L -o model.gguf "URL"
# Transfer via USB or scp
scp model.gguf phone:~/models/

📚 Resources

💡 Best Practices

Start small: Begin with TinyLlama 1.1B to test your device
Monitor RAM: Use htop or Termux's top to watch memory usage
Use tmux: Run long inference sessions in tmux to prevent disconnection
Cool your phone: CPU inference generates heat; consider active cooling
Cloud for production: Use local LLM for testing, cloud for real work

**Pro Tip**: Use OCA's hybrid mode — route simple queries to local LLM, complex tasks to cloud APIs. Best of both worlds!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

⚠️ Practical Constraints

☁️ Ollama Cloud Models

Quick Start

Recommended Cloud Models

Commands

Why Cloud Models?

🚀 Quick Start

Option 1: node-llama-cpp (Recommended for Android)

Option 2: Ollama (Full Server)

🔗 Official Ollama OpenClaw Integration

Why it's powerful

Setup Methods

📊 Model Recommendations

🔧 Configuration

node-llama-cpp Context Length

Ollama Configuration

🌐 Cloud vs Local Comparison

🛠️ Troubleshooting

"Cannot find module 'node-llama-cpp'"

"Out of memory" error

Ollama killed by Android

Model download fails

📚 Resources

💡 Best Practices

FilesExpand file tree

local-llm.mdx

Latest commit

History

local-llm.mdx

File metadata and controls

Overview

⚠️ Practical Constraints

☁️ Ollama Cloud Models

Quick Start

Recommended Cloud Models

Commands

Why Cloud Models?

🚀 Quick Start

Option 1: node-llama-cpp (Recommended for Android)

Option 2: Ollama (Full Server)

🔗 Official Ollama OpenClaw Integration

Why it's powerful

Setup Methods

📊 Model Recommendations

🔧 Configuration

node-llama-cpp Context Length

Ollama Configuration

🌐 Cloud vs Local Comparison

🛠️ Troubleshooting

"Cannot find module 'node-llama-cpp'"

"Out of memory" error

Ollama killed by Android

Model download fails

📚 Resources

💡 Best Practices