LLM Streaming API

A FastAPI-based streaming API for interacting with local LLMs using Ollama. Completely free and runs locally on your machine!

Features

Streaming Responses: Real-time streaming responses using Server-Sent Events (SSE)
Simple API: Ask a question, get a streaming response
FastAPI Framework: Modern, fast, and well-documented API framework
Ollama Integration: Run powerful open-source LLMs locally (Llama, Mistral, Phi, etc.)
No API Keys Required: Completely free, private, and runs offline
Type Safety: Full Pydantic validation for requests and responses
Easy Configuration: Environment-based configuration

Prerequisites

Python 3.8+
Ollama installed and running (Install here)

Setup Ollama

Install Ollama:

Linux:
```
curl -fsSL https://ollama.com/install.sh | sh
```
macOS:
```
brew install ollama
```
Windows: Download from https://ollama.com/download
Start Ollama (if not already running):
```
ollama serve
```

Pull a model (choose one or more):

# Llama 3.2 (1B - Fast, recommended for testing)
ollama pull llama3.2

# Llama 3.2 (3B - Better quality)
ollama pull llama3.2:3b

# Mistral (7B - Good balance)
ollama pull mistral

# Phi-3 (3.8B - Microsoft's model)
ollama pull phi3

Verify Ollama is running:
```
ollama list
```

Installation

Clone the repository:

git clone <your-repo-url>
cd llm_streaming_api

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables (optional):

cp .env.example .env

The default configuration works out of the box. Only edit .env if you need to change Ollama host:

OLLAMA_HOST=http://localhost:11434

Running the Server

Start the FastAPI server:

python main.py

Or using uvicorn directly:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

API Documentation

Once the server is running, visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Testing the Streaming

We provide multiple ways to test and visualize the streaming functionality:

1. Interactive HTML Demo (Recommended)

Open demo.html in your browser for a beautiful UI with real-time streaming visualization:

xdg-open demo.html  # Linux
# or: open demo.html  # macOS
# or: start demo.html  # Windows

2. Python Test Script

Run the test script to see streaming in your terminal:

python test_streaming.py

3. cURL with Streaming

curl --no-buffer -N -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain Python in one sentence"}'

4. Quick Start Guide

See QUICKSTART.md for a step-by-step 5-minute setup guide.

API Endpoints

POST /chat

Streaming chat completion endpoint. Ask a question and receive a streaming response.

Request Body:

{
  "question": "What is the capital of France?",
  "model": "llama3.2",
  "max_tokens": 1024,
  "temperature": 1.0
}

Parameters:

question (required): The user's question to ask the LLM
model (optional): Ollama model to use (default: "llama3.2")
max_tokens (optional): Maximum tokens to generate (default: 1024)
temperature (optional): Sampling temperature 0.0-1.0 (default: 1.0)

Available Models (must be pulled first with ollama pull <model>):

llama3.2 - Meta's Llama 3.2 1B (recommended, fast)
llama3.2:3b - Meta's Llama 3.2 3B (better quality)
mistral - Mistral 7B (good balance)
phi3 - Microsoft's Phi-3 (3.8B)
gemma2:2b - Google's Gemma 2 2B (fast)
qwen2.5:3b - Alibaba's Qwen 2.5 3B
See all models: https://ollama.com/library

GET /health

Health check endpoint.

Response:

{
  "status": "healthy"
}

Usage Examples

Using cURL

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the capital of France?"
  }'

With custom parameters:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain quantum computing in simple terms",
    "max_tokens": 2048,
    "temperature": 0.7
  }'

Using Python

import httpx
import json

async def ask_question(question: str):
    url = "http://localhost:8000/chat"

    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            url,
            json={"question": question},
            timeout=30.0
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = json.loads(line[6:])
                    if data.get("type") == "content_block_delta":
                        print(data.get("content", ""), end="", flush=True)
                    elif data.get("type") == "message_stop":
                        print("\n")

# Usage
import asyncio
asyncio.run(ask_question("What is FastAPI?"))

Using JavaScript (fetch)

const response = await fetch('http://localhost:8000/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    question: 'What is machine learning?'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const {done, value} = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  const lines = text.split('\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      if (data.type === 'content_block_delta') {
        console.log(data.content);
      }
    }
  }
}

Response Format

The API returns a streaming response using Server-Sent Events (SSE) format:

data: {"content": "The", "type": "content_block_delta"}

data: {"content": " capital", "type": "content_block_delta"}

data: {"content": " of", "type": "content_block_delta"}

data: {"content": " France", "type": "content_block_delta"}

data: {"content": " is", "type": "content_block_delta"}

data: {"content": " Paris", "type": "content_block_delta"}

data: {"content": ".", "type": "content_block_delta"}

data: {"type": "message_stop"}

Event Types:

content_block_delta: Contains a chunk of the response text in the content field
message_stop: Signals the end of the streaming response
error: Contains error information if something went wrong

Project Structure

llm_streaming_api/
├── controllers/              # API route controllers/routers
│   ├── __init__.py
│   ├── chat_controller.py    # Chat endpoint routes
│   └── health_controller.py  # Health check routes
├── models/                   # Pydantic models
│   ├── __init__.py
│   └── chat.py              # Chat request/response models
├── services/                 # Business logic services
│   ├── __init__.py
│   └── llm_service.py       # LLM interaction service (with async streaming)
├── main.py                   # Application entry point (with CORS)
├── config.py                 # Configuration management
├── requirements.txt          # Python dependencies
├── .env.example             # Example environment variables
├── .gitignore               # Git ignore file
├── demo.html                # Interactive HTML demo for testing
├── test_streaming.py        # Python script to test streaming
├── QUICKSTART.md            # Quick start guide
└── readme.md                # This file

Architecture

The project follows a clean architecture pattern:

Controllers: Handle HTTP requests and responses, validate input, and call services
Services: Contain business logic and interact with Ollama
- Uses ThreadPoolExecutor for non-blocking streaming
- asyncio.Queue bridges sync Ollama client with async FastAPI
- Enables true real-time streaming without blocking other requests
Models: Define data structures and validation rules using Pydantic
Config: Centralized configuration management with environment variables
CORS Middleware: Allows browser-based clients (like demo.html) to access the API

Configuration

Configuration is managed through environment variables in .env:

OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)
DEFAULT_MODEL: Default Ollama model to use (default: mistral)
HOST: API server host (default: 0.0.0.0)
PORT: API server port (default: 8000)

Error Handling

The API includes error handling for:

Ollama connection errors
Model not found errors
Network errors
Invalid request parameters

Errors are returned in the response with appropriate status codes and messages.

Development

To contribute or modify:

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

Security Considerations

The API runs locally with no external API calls
All data stays on your machine (completely private)
Consider adding authentication if exposing the API publicly
Implement rate limiting for production deployments
Ollama runs on localhost by default (not exposed to internet)

License

MIT License - feel free to use this project for any purpose.

Support

For issues or questions:

Check the API documentation
Open an issue on GitHub

Acknowledgments

Built with FastAPI
Powered by Ollama
Open-source LLMs: Llama, Mistral, Phi, and more

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
controllers		controllers
models		models
services		services
.env.example		.env.example
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
config.py		config.py
demo.html		demo.html
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
test_streaming.py		test_streaming.py

Folders and files

Latest commit

History

Repository files navigation

LLM Streaming API

Features

Prerequisites

Setup Ollama

Installation

Running the Server

API Documentation

Testing the Streaming

1. Interactive HTML Demo (Recommended)

2. Python Test Script

3. cURL with Streaming

4. Quick Start Guide

API Endpoints

POST /chat

GET /health

Usage Examples

Using cURL

Using Python

Using JavaScript (fetch)

Response Format

Project Structure

Architecture

Configuration

Error Handling

Development

Security Considerations

License

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages