Skip to content

fassahat/llm_streaming_api

Repository files navigation

LLM Streaming API

A FastAPI-based streaming API for interacting with local LLMs using Ollama. Completely free and runs locally on your machine!

Features

  • Streaming Responses: Real-time streaming responses using Server-Sent Events (SSE)
  • Simple API: Ask a question, get a streaming response
  • FastAPI Framework: Modern, fast, and well-documented API framework
  • Ollama Integration: Run powerful open-source LLMs locally (Llama, Mistral, Phi, etc.)
  • No API Keys Required: Completely free, private, and runs offline
  • Type Safety: Full Pydantic validation for requests and responses
  • Easy Configuration: Environment-based configuration

Prerequisites

Setup Ollama

  1. Install Ollama:

    Linux:

    curl -fsSL https://ollama.com/install.sh | sh

    macOS:

    brew install ollama

    Windows: Download from https://ollama.com/download

  2. Start Ollama (if not already running):

    ollama serve
  3. Pull a model (choose one or more):

    # Llama 3.2 (1B - Fast, recommended for testing)
    ollama pull llama3.2
    
    # Llama 3.2 (3B - Better quality)
    ollama pull llama3.2:3b
    
    # Mistral (7B - Good balance)
    ollama pull mistral
    
    # Phi-3 (3.8B - Microsoft's model)
    ollama pull phi3
  4. Verify Ollama is running:

    ollama list

Installation

  1. Clone the repository:
git clone <your-repo-url>
cd llm_streaming_api
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (optional):
cp .env.example .env

The default configuration works out of the box. Only edit .env if you need to change Ollama host:

OLLAMA_HOST=http://localhost:11434

Running the Server

Start the FastAPI server:

python main.py

Or using uvicorn directly:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

API Documentation

Once the server is running, visit:

Testing the Streaming

We provide multiple ways to test and visualize the streaming functionality:

1. Interactive HTML Demo (Recommended)

Open demo.html in your browser for a beautiful UI with real-time streaming visualization:

xdg-open demo.html  # Linux
# or: open demo.html  # macOS
# or: start demo.html  # Windows

2. Python Test Script

Run the test script to see streaming in your terminal:

python test_streaming.py

3. cURL with Streaming

curl --no-buffer -N -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain Python in one sentence"}'

4. Quick Start Guide

See QUICKSTART.md for a step-by-step 5-minute setup guide.

API Endpoints

POST /chat

Streaming chat completion endpoint. Ask a question and receive a streaming response.

Request Body:

{
  "question": "What is the capital of France?",
  "model": "llama3.2",
  "max_tokens": 1024,
  "temperature": 1.0
}

Parameters:

  • question (required): The user's question to ask the LLM
  • model (optional): Ollama model to use (default: "llama3.2")
  • max_tokens (optional): Maximum tokens to generate (default: 1024)
  • temperature (optional): Sampling temperature 0.0-1.0 (default: 1.0)

Available Models (must be pulled first with ollama pull <model>):

  • llama3.2 - Meta's Llama 3.2 1B (recommended, fast)
  • llama3.2:3b - Meta's Llama 3.2 3B (better quality)
  • mistral - Mistral 7B (good balance)
  • phi3 - Microsoft's Phi-3 (3.8B)
  • gemma2:2b - Google's Gemma 2 2B (fast)
  • qwen2.5:3b - Alibaba's Qwen 2.5 3B
  • See all models: https://ollama.com/library

GET /health

Health check endpoint.

Response:

{
  "status": "healthy"
}

Usage Examples

Using cURL

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the capital of France?"
  }'

With custom parameters:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain quantum computing in simple terms",
    "max_tokens": 2048,
    "temperature": 0.7
  }'

Using Python

import httpx
import json

async def ask_question(question: str):
    url = "http://localhost:8000/chat"

    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            url,
            json={"question": question},
            timeout=30.0
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = json.loads(line[6:])
                    if data.get("type") == "content_block_delta":
                        print(data.get("content", ""), end="", flush=True)
                    elif data.get("type") == "message_stop":
                        print("\n")

# Usage
import asyncio
asyncio.run(ask_question("What is FastAPI?"))

Using JavaScript (fetch)

const response = await fetch('http://localhost:8000/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    question: 'What is machine learning?'
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const {done, value} = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  const lines = text.split('\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      if (data.type === 'content_block_delta') {
        console.log(data.content);
      }
    }
  }
}

Response Format

The API returns a streaming response using Server-Sent Events (SSE) format:

data: {"content": "The", "type": "content_block_delta"}

data: {"content": " capital", "type": "content_block_delta"}

data: {"content": " of", "type": "content_block_delta"}

data: {"content": " France", "type": "content_block_delta"}

data: {"content": " is", "type": "content_block_delta"}

data: {"content": " Paris", "type": "content_block_delta"}

data: {"content": ".", "type": "content_block_delta"}

data: {"type": "message_stop"}

Event Types:

  • content_block_delta: Contains a chunk of the response text in the content field
  • message_stop: Signals the end of the streaming response
  • error: Contains error information if something went wrong

Project Structure

llm_streaming_api/
├── controllers/              # API route controllers/routers
│   ├── __init__.py
│   ├── chat_controller.py    # Chat endpoint routes
│   └── health_controller.py  # Health check routes
├── models/                   # Pydantic models
│   ├── __init__.py
│   └── chat.py              # Chat request/response models
├── services/                 # Business logic services
│   ├── __init__.py
│   └── llm_service.py       # LLM interaction service (with async streaming)
├── main.py                   # Application entry point (with CORS)
├── config.py                 # Configuration management
├── requirements.txt          # Python dependencies
├── .env.example             # Example environment variables
├── .gitignore               # Git ignore file
├── demo.html                # Interactive HTML demo for testing
├── test_streaming.py        # Python script to test streaming
├── QUICKSTART.md            # Quick start guide
└── readme.md                # This file

Architecture

The project follows a clean architecture pattern:

  • Controllers: Handle HTTP requests and responses, validate input, and call services
  • Services: Contain business logic and interact with Ollama
    • Uses ThreadPoolExecutor for non-blocking streaming
    • asyncio.Queue bridges sync Ollama client with async FastAPI
    • Enables true real-time streaming without blocking other requests
  • Models: Define data structures and validation rules using Pydantic
  • Config: Centralized configuration management with environment variables
  • CORS Middleware: Allows browser-based clients (like demo.html) to access the API

Configuration

Configuration is managed through environment variables in .env:

  • OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)
  • DEFAULT_MODEL: Default Ollama model to use (default: mistral)
  • HOST: API server host (default: 0.0.0.0)
  • PORT: API server port (default: 8000)

Error Handling

The API includes error handling for:

  • Ollama connection errors
  • Model not found errors
  • Network errors
  • Invalid request parameters

Errors are returned in the response with appropriate status codes and messages.

Development

To contribute or modify:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

Security Considerations

  • The API runs locally with no external API calls
  • All data stays on your machine (completely private)
  • Consider adding authentication if exposing the API publicly
  • Implement rate limiting for production deployments
  • Ollama runs on localhost by default (not exposed to internet)

License

MIT License - feel free to use this project for any purpose.

Support

For issues or questions:

Acknowledgments

  • Built with FastAPI
  • Powered by Ollama
  • Open-source LLMs: Llama, Mistral, Phi, and more

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors