wisper

Using Local Models with Wisper

Running Wisper entirely locally gives you:

Both the transcription (speech-to-text) and LLM formatting steps can be run locally and independently. You can mix and match: for example, use a local transcription server with a cloud LLM, or vice versa. For good fully local performance, you’ll want to choose models that can both fit in memory at the same time and ideally run on and Nvidia GPU.


Local Transcription — speaches

speaches is the recommended self-hosted Whisper server. It exposes an OpenAI-compatible /v1/audio/transcriptions speech-to-text endpoint and supports GPU acceleration via faster-whisper. Speaches also supports Text-to-Speech models, but this is not used by Wisper and need not be configured.

Speaches Setup

Using Docker (simplest approach)

You’ll need Docker Engine with the Compose plugin. If you don’t already have them, the quickest way on any Linux dist is curl -fsSL https://get.docker.com | sh. Or follow the distro-specific instructions (this covers both the Engine and the Compose plugin).

Start speaches once to pull the image (pick the line matching your hardware):

# Nvidia GPU with CDI support (most Nvidia users on recent systems):
docker compose -f https://github.com/speaches-ai/speaches.git#master:compose.cuda-cdi.yaml up --detach

# Nvidia GPU without CDI:
docker compose -f https://github.com/speaches-ai/speaches.git#master:compose.cuda.yaml up --detach

# CPU only (other GPUs are not supported)
docker compose -f https://github.com/speaches-ai/speaches.git#master:compose.cpu.yaml up --detach

Without Docker: speaches can also be run from source using uv — see the speaches installation docs:

git clone https://github.com/speaches-ai/speaches.git
cd speaches
uv python install
uv venv && source .venv/bin/activate
uv sync
uvicorn --factory --host 0.0.0.0 speaches.main:create_app

Install uv via curl -LsSf https://astral.sh/uv/install.sh | sh if you don’t have it.

Download a Whisper Text-to-Speech model for Speaches

Pick a model from the table below. A good multilingual default is Systran/faster-whisper-large-v3 if you have the GPU and memory.

Then download (replacing the model name as needed) using either of these two approaches:

# the curl API approach:
curl -X POST http://localhost:8000/v1/models/Systran/faster-whisper-large-v3

# the speaches-cli apprach (if you have `uv` installed):
SPEACHES_BASE_URL="http://localhost:8000" uvx speaches-cli model download Systran/faster-whisper-large-v3

The download may take a few minutes. After that, Wisper can start speaches automatically via the Start Command (see below) — you won’t need to run speaches manually again.

Wisper settings (Transcription tab)

Setting Value
Provider Custom
API URL http://localhost:8000/v1/audio/transcriptions
Model name Exact model name as downloaded (e.g. Systran/faster-whisper-large-v3)
Start Command (optional) eg: docker compose -f https://github.com/speaches-ai/speaches.git#master:compose.cuda-cdi.yaml up --detach

Replace compose.cuda-cdi.yaml with whichever variant you need (see setup above). Docker Compose caches the repo locally after the first run, so this is fast and works offline after that. Wisper runs the command automatically if speaches isn’t up and responding when you try to record.

Model Size Language Notes
Systran/faster-whisper-large-v3 ~3 GB Multilingual Best accuracy; recommended default
Systran/faster-whisper-medium ~1.5 GB Multilingual Good balance of speed and accuracy
Systran/faster-whisper-small ~470 MB Multilingual Fast; lower accuracy
Systran/faster-distil-whisper-large-v3 ~1.5 GB English only Fast and accurate, but English only
Systran/faster-distil-whisper-small.en ~150 MB English only Very fast; English only

Local LLM Formatting — ollama

ollama is the simplest way to run local LLMs on Linux. It installs as a system service, exposes an OpenAI-compatible /v1/chat/completions endpoint, and manages model downloads automatically.

Setup

# Install ollama (runs as a system service automatically after install)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (see recommendations below)
ollama pull llama3.1:8b

Wisper settings (Formatting tab)

Setting Value
Provider Custom
API URL http://localhost:11434/v1/chat/completions (this is the default)
Model name Exact name as pulled, e.g. llama3.1:8b or gemma3:4b
Start Command (optional) ollama serve

The Start Command is only needed if ollama isn’t running as a service. Most installs don’t need it; the model will load upon first request.

Model recommendations

Use a model with at least 8B parameters for reliable punctuation correction and filler-word removal. Smaller models can produce erratic results.

Model Size Notes
gemma3:4b ~3 GB Rare model under 8b that works well; good if 8B is too big or slow
llama3.1:8b ~5 GB Good quality; recommended minimum

If neither model runs well locally, consider using a cloud provider (Groq’s free tier works well) or disabling LLM formatting altogether — the raw Whisper output is still quite good. If you have the scope for larger models, they do tend to work even better.


Auto-start and warm-up

Start Command

The optional Start Command field (in Settings, under Custom for either provider) lets Wisper launch the server automatically. On each recording, Wisper health-checks the endpoint. If it doesn’t respond and a Start Command is set, Wisper runs it and waits up to 15 seconds for the server to come up before proceeding.

Warm-up

After a server starts (or after it hasn’t been used for ~5 minutes), Wisper sends a silent warm-up request to pre-load the model into GPU memory as soon as dictation begins. This reduces or even eliminates the long first-request latency you’d otherwise see when the model is loaded on demand.


Troubleshooting

Check /tmp/wisper.log for detailed error messages.

Server not starting

Wrong model name

Slow first transcription

Health check or warm-up failing