Use this guide if you are setting up speech features on:
- NVIDIA GPU systems
- Apple Silicon systems
This guide supports:
make-driven local setup- manual/local Python setup
- Docker + WebUI setup
Important: the stock Docker quickstart is not a turnkey GPU-enabled audio profile. If you want the fastest first successful accelerated setup, local Python or make is the better path today.
| Hardware | Recommended STT | Fallback STT | Recommended TTS | Why |
|---|---|---|---|---|
| NVIDIA | faster-whisper | parakeet-tdt-0.6b-v3-onnx |
supertonic |
best first-run accelerated STT path in current repo, with a simpler local TTS path |
| Apple Silicon | parakeet-mlx |
parakeet-tdt-0.6b-v3-onnx |
supertonic |
makes MLX the primary speech acceleration path while keeping TTS local-first |
Alternatives:
- If you need local voice cloning on day one:
pocket_tts - If you want a better but more demanding TTS stack after the basics work:
qwen3_tts
Important current-repo realities:
- The shipped
config.txtdefaults useparakeet-tdt-0.6b-v3-onnxfor STT (the CPU-friendly default). The shorterparakeet-onnxalias remains supported for older configs. This guide shows you how to change those defaults to GPU-optimized engines. - The
/setupbundle docs may recommend a different first-run STT path for some hardware classes. - Stock Docker CPU/default audio works with bundled dependencies, but the stock Docker profile is not a ready-made GPU-accelerated audio path. Host-side config or model edits require a rebuild,
Dockerfiles/docker-compose.host-storage.yml, or a custom image path.
Use this if:
nvidia-smiworks on the host- you want accelerated faster-whisper first
Use this if:
- you are on an M-series Mac
- you want MLX-based Parakeet as the main STT path
- Git
- Python 3.10+ for local/manual or
make ffmpeggit-lfsif you want the recommendedsupertonicpath
- current NVIDIA drivers
- a working
nvidia-smi - CUDA-capable runtime for your chosen environment
Check this first:
nvidia-smi- Apple Silicon Mac
- Python 3.10+
- ability to install MLX packages in the active environment
Linux:
sudo apt-get update
sudo apt-get install -y ffmpeg git git-lfs python3 python3-venv
git lfs installmacOS:
brew install ffmpeg git git-lfs python@3.12
git lfs installWindows:
- install Python 3.10+
- install FFmpeg
- install Git and Git LFS
- for NVIDIA, confirm
nvidia-smiworks in PowerShell
Then:
git lfs installIf your server is already running, skip to Step 2.
git clone https://github.com/rmusser01/tldw_server.git
cd tldw_server
make install-local
make setup-local-single
make start-local-singleLinux/macOS:
git clone https://github.com/rmusser01/tldw_server.git
cd tldw_server
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
python -m uvicorn tldw_Server_API.app.main:app --reloadWindows PowerShell:
git clone https://github.com/rmusser01/tldw_server.git
cd tldw_server
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .
python -m uvicorn tldw_Server_API.app.main:app --reloadgit clone https://github.com/rmusser01/tldw_server.git
cd tldw_server
cp tldw_Server_API/Config_Files/.env.example tldw_Server_API/Config_Files/.envSet AUTH_MODE=single_user and SINGLE_USER_API_KEY=..., then:
docker compose --env-file tldw_Server_API/Config_Files/.env \
-f Dockerfiles/docker-compose.single-user.yml \
-f Dockerfiles/docker-compose.webui.yml \
up -d --buildImportant Docker note:
- stock Docker CPU/default audio works with bundled dependencies
- the default compose profile is not a ready-made accelerated audio profile
- the app service does not declare GPU runtime reservations in the stock compose file
- host-side
Config_Filesandmodels/changes require a rebuild,Dockerfiles/docker-compose.host-storage.yml, or a custom image path
For accelerated audio, local/manual or make is the recommended first path.
Edit config.txt:
[STT-Settings]
default_batch_transcription_model = whisper-1
default_streaming_transcription_model = whisper-1
default_transcriber = faster-whisperNotes:
whisper-1is the simplest OpenAI-compatible starting point and maps to the faster-whisper Whisper path.- If your GPU is smaller and
whisper-1is too heavy, switch both defaults to a smaller faster-whisper model such asmedium. - If accelerated Whisper setup becomes unstable, fall back to
parakeet-tdt-0.6b-v3-onnx.
Install the MLX STT extras in your active environment:
pip install -e '.[STT_Parakeet_MLX]'Then edit config.txt:
[STT-Settings]
default_batch_transcription_model = parakeet-mlx
default_streaming_transcription_model = parakeet-mlx
default_transcriber = parakeet
nemo_model_variant = mlxIf your accelerated path is not stable yet, use:
[STT-Settings]
default_batch_transcription_model = parakeet-tdt-0.6b-v3-onnx
default_streaming_transcription_model = parakeet-tdt-0.6b-v3-onnx
default_transcriber = parakeet
nemo_model_variant = onnxIf you are on Docker and you edited the host config, rebuild the app image.
The accelerated guide still recommends supertonic as the first local TTS path because it stays much simpler than the heavier TTS stacks.
python Helper_Scripts/TTS_Installers/install_tts_supertonic.pyEdit tts_providers_config.yaml:
providers:
supertonic:
enabled: true
model_path: "models/supertonic/onnx"
sample_rate: 24000
device: "cpu"
extra_params:
voice_styles_dir: "models/supertonic/voice_styles"
default_voice: "supertonic_m1"
voice_files:
supertonic_m1: "M1.json"
supertonic_f1: "F1.json"
default_total_step: 5
default_speed: 1.05
n_test: 1Edit config.txt:
[TTS-Settings]
default_provider = supertonic
default_voice = supertonic_m1Restart the server after changes.
Verify the accelerated lane you intended, then verify real TTS and STT.
Choose one reusable auth header before running the commands.
Single-user auth mode:
AUTH_HEADER=(-H "X-API-KEY: $SINGLE_USER_API_KEY")Multi-user auth mode:
JWT=$(
curl -sS -X POST http://127.0.0.1:8000/api/v1/auth/login \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "username=$ADMIN_USERNAME" \
-d "password=$ADMIN_PASSWORD" | jq -r '.access_token'
)
AUTH_HEADER=(-H "Authorization: Bearer $JWT")curl -sS http://127.0.0.1:8000/api/v1/audio/health \
"${AUTH_HEADER[@]}"curl -sS http://127.0.0.1:8000/api/v1/audio/voices/catalog \
"${AUTH_HEADER[@]}" | jq '.supertonic'curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/speech \
"${AUTH_HEADER[@]}" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-supertonic-1",
"voice": "supertonic_m1",
"input": "This is the accelerated audio setup smoke test.",
"response_format": "wav",
"stream": false
}' \
--output accelerated_audio_smoke.wavHost check:
nvidia-smiSTT readiness:
curl -sS "http://127.0.0.1:8000/api/v1/audio/transcriptions/health?model=whisper-1&warm=true" \
"${AUTH_HEADER[@]}"You want to see Whisper reported as usable and warm initialization succeeding.
STT readiness:
curl -sS "http://127.0.0.1:8000/api/v1/audio/transcriptions/health?model=parakeet-mlx" \
"${AUTH_HEADER[@]}"You want to see:
"provider": "parakeet""alias": "parakeet-mlx""usable": trueor"available": true
curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/transcriptions \
"${AUTH_HEADER[@]}" \
-F "file=@accelerated_audio_smoke.wav" \
-F "model=whisper-1"curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/transcriptions \
"${AUTH_HEADER[@]}" \
-F "file=@accelerated_audio_smoke.wav" \
-F "model=parakeet-mlx"Success means:
- the request completes
- the
textfield is close toThis is the accelerated audio setup smoke test - the backend matches the path you intended
Use a PocketTTS runtime instead of supertonic if local voice cloning matters more than the simplest first-run TTS path.
Use:
- PocketTTS Voice Cloning Guide for
pocket_tts(Python/ONNX) python Helper_Scripts/TTS_Installers/install_tts_pocket_tts_cpp.pyforpocket_tts_cpp(compiled native runtime)
Tradeoffs:
pocket_ttsis the ONNX/Python runtime and is the simplest PocketTTS path to read and debug.pocket_tts_cppis a separate compiled runtime and uses a different installer and runtime layout.- Both are excellent if voice cloning is the point.
- Both are worse than the default first-sound path because you still need either a direct
voice_referenceclip or a storedcustom:<voice_id>voice. pocket_tts_cppstreaming is only available when the local CLI probe proves incremental on this install; otherwise streaming requests fail closed.
After the basic accelerated stack works, move to:
Treat it as the advanced upgrade path, not the baseline.
- verify
nvidia-smion the host first - keep
whisper-1only if your card can handle it; otherwise switch tomedium - if the accelerated Whisper path is still unstable, switch to
parakeet-tdt-0.6b-v3-onnxand get speech working first
- confirm you installed:
pip install -e '.[STT_Parakeet_MLX]'- verify the config really says
parakeet-mlx - if MLX still does not initialize, fall back to
parakeet-tdt-0.6b-v3-onnx
- make the defaults explicit in config.txt
- do not rely on implicit provider selection if you care which backend is used
- verify with
/api/v1/audio/transcriptions/health?model=...
- the stock app compose profile is not a GPU-optimized audio compose file
- host-side config changes require an image rebuild
- host-side model assets are not automatically mounted into the app container
If you want the least frustrating accelerated first run today, prefer local/manual or make.
That can happen today.
Use /setup when you want guided provisioning, then manually set:
- your STT defaults in config.txt
- your TTS provider in config.txt
- your enabled provider block in tts_providers_config.yaml