Special thanks to 雷哥 for providing the initial implementation and foundational code for this project.
雷哥's openai_whisper_compatible_api.py and integration ideas with Spokenly have provided a clear template for building a local, switchable-backend speech-to-text service. This has inspired the support for multi-model and MLX backends.
We sincerely appreciate their open-source contribution and inspiration.
RAPL (Remote Audio Processing Layer) is an OpenAI-compatible speech-to-text (ASR) API service that runs locally. It supports multiple backend models and can work with front-end speech input apps like Spokenly—no need to upload audio to the cloud.
- Python 3.10+
- Recommended: use a virtual environment
# After cloning or extracting the project, change to the directory
cd local-asr-api # folder name
# Create virtual environment (optional)
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtIf you want to use the MLX backend (recommended on macOS Apple Silicon), you’ll need torch and torchaudio installed. If you see No module named 'torch', run:
pip install torch torchaudioYou can select the backend and model via environment variables (or set the defaults at the top of openai_whisper_compatible_api.py):
| Variable | Description | Example |
|---|---|---|
BACKEND |
Backend engine | sensevoice or mlx |
LOCAL_MODEL |
Model id (HF or ModelScope id) | See table below |
Currently Supported Models:
| Backend | Model ID | Description (EN) | Cache Dir |
|---|---|---|---|
| SenseVoice | iic/SenseVoiceSmall |
Multilingual, emotion/event detection (FunASR / ModelScope) | ~/.cache/modelscope/hub/ |
| MLX | mlx-community/Qwen3-ASR-1.7B-8bit |
Faster inference on Mac (Hugging Face) | ~/.cache/huggingface/hub/ |
Models will be downloaded automatically to the above cache directories the first time you run them, then loaded from cache afterward.
# Using SenseVoice (default)
python openai_whisper_compatible_api.py
# Using MLX (e.g. on Mac)
export BACKEND=mlx
export LOCAL_MODEL=mlx-community/Qwen3-ASR-1.7B-8bit
python openai_whisper_compatible_api.pyBy default, the service listens on http://127.0.0.1:8000. A progress bar will display in the terminal showing transcription progress.
If new MLX-format ASR models appear on Hugging Face, just change the environment variables and restart—no code modification needed:
export BACKEND=mlx
export LOCAL_MODEL=mlx-community/your-new-model
python openai_whisper_compatible_api.pyRAPL implements an OpenAI Whisper API-compatible interface, so you can use it with any speech app that supports "OpenAI Compatible API" mode, such as Spokenly.
- Open Spokenly’s Dictation Models settings.
- Select “OpenAI Compatible API” or “</> API” type.
- Fill out the settings:
- URL:
http://127.0.0.1:8000(must match RAPL’s address) - Model: The same as used by RAPL, e.g.
mlx-community/Qwen3-ASR-1.7B-8bitoriic/SenseVoiceSmall - API Key: Not required for local use, just fill with any value (e.g.
anything)
- URL:
- Click Test & Save.
- Spokenly records speech and sends it to RAPL on your local machine.
- RAPL transcribes using a local model—your audio never leaves your device, making it privacy-friendly.
- The transcription result is returned to Spokenly, in OpenAI-compatible format, for dictation, subtitles, etc.
If Spokenly or your system requires HTTPS, you can set up a local reverse proxy or TLS termination. By default, RAPL is HTTP only.
RAPL automatically preprocesses uploaded audio before inference using ffmpeg. All incoming audio is converted to 16kHz mono 16-bit WAV — the optimal format for ASR models. This happens transparently; callers do not need to change anything.
ASR models only need 16kHz mono audio. Most input files are higher quality than necessary (44.1kHz stereo, lossless formats, etc.). Downsampling before inference reduces memory usage and improves processing speed.
- ffmpeg must be installed on the system (
brew install ffmpegon macOS,apt install ffmpegon Linux). - If ffmpeg is not available, RAPL gracefully falls back to using the original uploaded file — nothing breaks.
| Metric | Without Preprocessing | With Preprocessing |
|---|---|---|
| Raw waveform in RAM (30s, 44.1kHz stereo) | ~5.3 MB | ~960 KB (~5.5x reduction) |
| Raw waveform in RAM (30s, 16kHz mono) | ~960 KB | ~960 KB (no change) |
| Transcription speed (short utterances, 5-15s) | baseline | ~5-10% faster |
| Transcription speed (longer audio, 60s+) | baseline | ~10-20% faster |
Note: The neural network forward pass (the main bottleneck) operates on fixed-rate feature frames regardless of input sample rate. The speed gain comes from faster file loading and feature extraction. The bigger win is memory reduction, which improves stability on memory-constrained machines or with long recordings.
Below is the overall architecture of RAPL working with Spokenly (see Mermaid diagram in supported Markdown previewers):
flowchart TB
subgraph Client["Client"]
Spokenly["Spokenly Speech Input App"]
end
subgraph RAPL["RAPL Local Service (http://127.0.0.1:8000)"]
direction TB
API["FastAPI Service"]
API --> Router["Router Layer"]
Router --> Transcribe["POST /v1/audio/transcriptions"]
Transcribe --> Preprocess["ffmpeg: 16kHz mono WAV"]
Preprocess --> Progress["Progress Bar (tqdm)"]
Progress --> Backend["Backend Selector"]
Backend --> SenseVoice["SenseVoice Backend"]
Backend --> MLX["MLX Backend"]
SenseVoice --> ModelScope["ModelScope / FunASR"]
MLX --> HuggingFace["Hugging Face / mlx-audio"]
ModelScope --> Cache1["~/.cache/modelscope"]
HuggingFace --> Cache2["~/.cache/huggingface/hub"]
end
Spokenly -->|"Upload audio (multipart)"| API
API -->|"JSON: { text }"| Spokenly
Simplified Data Flow:
sequenceDiagram
participant S as Spokenly
participant R as RAPL (FastAPI)
participant M as Local Model (SenseVoice / MLX)
S->>R: POST /v1/audio/transcriptions (audio file)
R->>R: Save temp file → Preprocess (ffmpeg: 16kHz mono WAV)
R->>R: Compute duration → Show progress bar
R->>M: Call current backend for transcription
M->>R: Return text
R->>R: Close progress bar
R->>S: 200 OK { "text": "transcribed text" }
- API Script:
openai_whisper_compatible_api.py - Dependencies:
requirements.txt - Changelog/Design Notes: See
CHANGELOG.md - SenseVoice original info and citation: See ModelScope / FunASR