A FastAPI-based streaming API for interacting with local LLMs using Ollama. Completely free and runs locally on your machine!
- Streaming Responses: Real-time streaming responses using Server-Sent Events (SSE)
- Simple API: Ask a question, get a streaming response
- FastAPI Framework: Modern, fast, and well-documented API framework
- Ollama Integration: Run powerful open-source LLMs locally (Llama, Mistral, Phi, etc.)
- No API Keys Required: Completely free, private, and runs offline
- Type Safety: Full Pydantic validation for requests and responses
- Easy Configuration: Environment-based configuration
- Python 3.8+
- Ollama installed and running (Install here)
-
Install Ollama:
Linux:
curl -fsSL https://ollama.com/install.sh | shmacOS:
brew install ollama
Windows: Download from https://ollama.com/download
-
Start Ollama (if not already running):
ollama serve
-
Pull a model (choose one or more):
# Llama 3.2 (1B - Fast, recommended for testing) ollama pull llama3.2 # Llama 3.2 (3B - Better quality) ollama pull llama3.2:3b # Mistral (7B - Good balance) ollama pull mistral # Phi-3 (3.8B - Microsoft's model) ollama pull phi3
-
Verify Ollama is running:
ollama list
- Clone the repository:
git clone <your-repo-url>
cd llm_streaming_api- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables (optional):
cp .env.example .envThe default configuration works out of the box. Only edit .env if you need to change Ollama host:
OLLAMA_HOST=http://localhost:11434
Start the FastAPI server:
python main.pyOr using uvicorn directly:
uvicorn main:app --reload --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
Once the server is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
We provide multiple ways to test and visualize the streaming functionality:
Open demo.html in your browser for a beautiful UI with real-time streaming visualization:
xdg-open demo.html # Linux
# or: open demo.html # macOS
# or: start demo.html # WindowsRun the test script to see streaming in your terminal:
python test_streaming.pycurl --no-buffer -N -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question": "Explain Python in one sentence"}'See QUICKSTART.md for a step-by-step 5-minute setup guide.
Streaming chat completion endpoint. Ask a question and receive a streaming response.
Request Body:
{
"question": "What is the capital of France?",
"model": "llama3.2",
"max_tokens": 1024,
"temperature": 1.0
}Parameters:
question(required): The user's question to ask the LLMmodel(optional): Ollama model to use (default: "llama3.2")max_tokens(optional): Maximum tokens to generate (default: 1024)temperature(optional): Sampling temperature 0.0-1.0 (default: 1.0)
Available Models (must be pulled first with ollama pull <model>):
llama3.2- Meta's Llama 3.2 1B (recommended, fast)llama3.2:3b- Meta's Llama 3.2 3B (better quality)mistral- Mistral 7B (good balance)phi3- Microsoft's Phi-3 (3.8B)gemma2:2b- Google's Gemma 2 2B (fast)qwen2.5:3b- Alibaba's Qwen 2.5 3B- See all models: https://ollama.com/library
Health check endpoint.
Response:
{
"status": "healthy"
}curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"question": "What is the capital of France?"
}'With custom parameters:
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"question": "Explain quantum computing in simple terms",
"max_tokens": 2048,
"temperature": 0.7
}'import httpx
import json
async def ask_question(question: str):
url = "http://localhost:8000/chat"
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
url,
json={"question": question},
timeout=30.0
) as response:
async for line in response.aiter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
if data.get("type") == "content_block_delta":
print(data.get("content", ""), end="", flush=True)
elif data.get("type") == "message_stop":
print("\n")
# Usage
import asyncio
asyncio.run(ask_question("What is FastAPI?"))const response = await fetch('http://localhost:8000/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
question: 'What is machine learning?'
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const {done, value} = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
if (data.type === 'content_block_delta') {
console.log(data.content);
}
}
}
}The API returns a streaming response using Server-Sent Events (SSE) format:
data: {"content": "The", "type": "content_block_delta"}
data: {"content": " capital", "type": "content_block_delta"}
data: {"content": " of", "type": "content_block_delta"}
data: {"content": " France", "type": "content_block_delta"}
data: {"content": " is", "type": "content_block_delta"}
data: {"content": " Paris", "type": "content_block_delta"}
data: {"content": ".", "type": "content_block_delta"}
data: {"type": "message_stop"}
Event Types:
content_block_delta: Contains a chunk of the response text in thecontentfieldmessage_stop: Signals the end of the streaming responseerror: Contains error information if something went wrong
llm_streaming_api/
├── controllers/ # API route controllers/routers
│ ├── __init__.py
│ ├── chat_controller.py # Chat endpoint routes
│ └── health_controller.py # Health check routes
├── models/ # Pydantic models
│ ├── __init__.py
│ └── chat.py # Chat request/response models
├── services/ # Business logic services
│ ├── __init__.py
│ └── llm_service.py # LLM interaction service (with async streaming)
├── main.py # Application entry point (with CORS)
├── config.py # Configuration management
├── requirements.txt # Python dependencies
├── .env.example # Example environment variables
├── .gitignore # Git ignore file
├── demo.html # Interactive HTML demo for testing
├── test_streaming.py # Python script to test streaming
├── QUICKSTART.md # Quick start guide
└── readme.md # This file
The project follows a clean architecture pattern:
- Controllers: Handle HTTP requests and responses, validate input, and call services
- Services: Contain business logic and interact with Ollama
- Uses ThreadPoolExecutor for non-blocking streaming
- asyncio.Queue bridges sync Ollama client with async FastAPI
- Enables true real-time streaming without blocking other requests
- Models: Define data structures and validation rules using Pydantic
- Config: Centralized configuration management with environment variables
- CORS Middleware: Allows browser-based clients (like demo.html) to access the API
Configuration is managed through environment variables in .env:
OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)DEFAULT_MODEL: Default Ollama model to use (default: mistral)HOST: API server host (default: 0.0.0.0)PORT: API server port (default: 8000)
The API includes error handling for:
- Ollama connection errors
- Model not found errors
- Network errors
- Invalid request parameters
Errors are returned in the response with appropriate status codes and messages.
To contribute or modify:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
- The API runs locally with no external API calls
- All data stays on your machine (completely private)
- Consider adding authentication if exposing the API publicly
- Implement rate limiting for production deployments
- Ollama runs on localhost by default (not exposed to internet)
MIT License - feel free to use this project for any purpose.
For issues or questions:
- Check the API documentation
- Open an issue on GitHub