Lesson 23 — OpenClaw Whisper Skill: Local Offline Speech-to-Text, Zero API Cost (2026)
Goal: Install the Whisper Skill so OpenClaw can transcribe audio files locally and offline, supporting multiple languages, without relying on any cloud API.
Local Whisper vs. Cloud API Comparison
Before diving in, understand the differences between the two approaches:
| Dimension | Whisper Local (this lesson) | Cloud API (e.g., OpenAI Whisper API) |
|---|---|---|
| Cost | Free, unlimited usage | ~$0.006/minute, ~$0.36/hour of audio |
| Privacy | Audio never leaves your machine | Audio uploaded to cloud servers |
| Network | No internet needed | Must be online |
| Speed | Depends on local hardware | Typically 10–30 seconds per hour of audio |
| English accuracy | Whisper medium/large: >90% | Comparable to local large model |
For meeting recordings containing business secrets or personal privacy, the local approach is the only secure option.
Step 1: Install the Whisper CLI
The Whisper Skill depends on the openai-whisper Python command-line tool. Install it first:
pip install openai-whisperVerify successful installation:
whisper --help
# Should output usage informationChoose a model size (automatically downloaded on first run):
| Model | Size | Best For |
|---|---|---|
| tiny | 75 MB | Quick testing, limited accuracy |
| base | 145 MB | Everyday English use |
| small | 483 MB | Both English and other languages, good starter |
| medium | 1.5 GB | Meeting recordings, recommended |
| large | 3 GB | Highest accuracy, top choice for demanding use cases |
If you need high accuracy, go straight to the large model — after the first download it runs entirely offline.
Step 2: Install the Skill
In WebChat or Telegram, send:
/install @steipete/openai-whisper
Verify installation:
pnpm openclaw skills list
# openai-whisper should appear in the listStep 3: Basic Transcription
After installation, send an audio file path to AI:
Transcribe this recording for me: ~/Downloads/meeting.mp3
Or use the slash command:
/whisper ~/Downloads/meeting.mp3
The Skill calls the local Whisper CLI and outputs a plain-text transcript. Supported audio formats: mp3, wav, m4a, flac, ogg, webm.
Step 4: Specify Language
Without specifying a language, Whisper auto-detects, but this can be unstable for mixed-language recordings. It's better to be explicit:
Transcribe this meeting recording in English: ~/Recordings/Q1-summary.m4a
Language: English
The underlying command is equivalent to:
whisper ~/Recordings/Q1-summary.m4a --language en --model mediumFor recordings with mixed languages (like a technical meeting), specify --language en (or your primary language) and Whisper will prioritize that language while preserving technical terms from other languages.
Step 5: Output Format Control
The Whisper Skill supports three output formats:
Plain text (default):
/whisper ~/recording.mp3 --format txt
SRT subtitle file with timestamps:
/whisper ~/recording.mp3 --format srt
Sample output:
1
00:00:01,200 --> 00:00:04,800
Good morning everyone, let's discuss the Q1 sales figures today.
2
00:00:05,100 --> 00:00:09,300
According to the latest report, overall growth was up 23%.JSON format (includes word-level timestamps, good for further processing):
/whisper ~/recording.mp3 --format json
Step 6: Transcribe Then Auto-Summarize
A complete workflow — transcribe a meeting and generate minutes in one instruction:
Transcribe this meeting recording, then generate meeting minutes (including decisions, action items, and owners):
~/Recordings/weekly-sync.mp3
OpenClaw will first call the Whisper Skill to transcribe, then call the Summarize Skill (Lesson 10) to generate structured minutes.
Step 7: Meeting Recording Practical Workflow
Complete flow: Record → Transcribe → Minutes → Save
1. After recording ends, send the file path to OpenClaw
2. Instruction: Convert this meeting recording to meeting minutes and save to Obsidian's /Meeting Notes/ directory:
~/Recordings/2026-03-30-product-meeting.mp3
OpenClaw executes:
- Whisper Skill transcribes locally, generates a text draft
- Summarize Skill distills the meeting minutes
- Saves to the specified directory via Obsidian MCP
To archive in Notion instead, replace the last step with the Notion MCP or API Gateway Skill (Lesson 25).
FAQ
Can OpenClaw Whisper recognize non-English languages?
Yes. Whisper medium and large models achieve over 90% accuracy for many languages and is one of the best open-source speech recognition models available. For best results, use --model medium or --model large, and specify the language explicitly with --language <lang_code> to avoid mixed-language recordings being misidentified. Regional dialects and accents may have limited effectiveness.
Does Whisper local transcription need a GPU?
No — CPU works too, just slower. On Apple Silicon (M1/M2/M3/M4), Whisper automatically uses Metal GPU acceleration, with performance close to a discrete GPU. On Intel CPUs, the small or medium model is recommended — the large model processing 1 hour of audio on CPU takes about 15–30 minutes. Install with pip install openai-whisper torch to enable PyTorch acceleration.
What's the transcription speed? How long for a 1-hour recording?
Speed varies by hardware and model: Apple M2 Pro with medium model: ~3–5 minutes for a 1-hour recording; Apple M2 Pro with large model: ~8–12 minutes; standard Intel i7 CPU with medium model: ~20–40 minutes. With an NVIDIA GPU, the large model can transcribe a 1-hour recording in 2–3 minutes.
Where are the SRT subtitle files saved?
With the --format srt parameter, Whisper saves the SRT file in the same directory as the audio file by default, using the same filename with the extension changed to .srt. For example, ~/Downloads/meeting.mp3 produces ~/Downloads/meeting.srt. You can also ask AI in your instruction to copy the SRT file to a specific location.