Lesson 23 — OpenClaw Whisper Skill: Local Offline Speech-to-Text, Zero API Cost (2026)

Goal: Install the Whisper Skill so OpenClaw can transcribe audio files locally and offline, supporting multiple languages, without relying on any cloud API.

Local Whisper vs. Cloud API Comparison

Before diving in, understand the differences between the two approaches:

Dimension	Whisper Local (this lesson)	Cloud API (e.g., OpenAI Whisper API)
Cost	Free, unlimited usage	~$0.006/minute, ~$0.36/hour of audio
Privacy	Audio never leaves your machine	Audio uploaded to cloud servers
Network	No internet needed	Must be online
Speed	Depends on local hardware	Typically 10–30 seconds per hour of audio
English accuracy	Whisper medium/large: >90%	Comparable to local large model

For meeting recordings containing business secrets or personal privacy, the local approach is the only secure option.

Step 1: Install the Whisper CLI

The Whisper Skill depends on the openai-whisper Python command-line tool. Install it first:

pip install openai-whisper

Verify successful installation:

whisper --help
# Should output usage information

Choose a model size (automatically downloaded on first run):

Model	Size	Best For
tiny	75 MB	Quick testing, limited accuracy
base	145 MB	Everyday English use
small	483 MB	Both English and other languages, good starter
medium	1.5 GB	Meeting recordings, recommended
large	3 GB	Highest accuracy, top choice for demanding use cases

If you need high accuracy, go straight to the large model — after the first download it runs entirely offline.

Step 2: Install the Skill

In WebChat or Telegram, send:

/install @steipete/openai-whisper

Verify installation:

pnpm openclaw skills list
# openai-whisper should appear in the list

Step 3: Basic Transcription

After installation, send an audio file path to AI:

Transcribe this recording for me: ~/Downloads/meeting.mp3

Or use the slash command:

/whisper ~/Downloads/meeting.mp3

The Skill calls the local Whisper CLI and outputs a plain-text transcript. Supported audio formats: mp3, wav, m4a, flac, ogg, webm.

Step 4: Specify Language

Without specifying a language, Whisper auto-detects, but this can be unstable for mixed-language recordings. It's better to be explicit:

Transcribe this meeting recording in English: ~/Recordings/Q1-summary.m4a
Language: English

The underlying command is equivalent to:

whisper ~/Recordings/Q1-summary.m4a --language en --model medium

For recordings with mixed languages (like a technical meeting), specify --language en (or your primary language) and Whisper will prioritize that language while preserving technical terms from other languages.

Step 5: Output Format Control

The Whisper Skill supports three output formats:

Plain text (default):

/whisper ~/recording.mp3 --format txt

SRT subtitle file with timestamps:

/whisper ~/recording.mp3 --format srt

Sample output:

1
00:00:01,200 --> 00:00:04,800
Good morning everyone, let's discuss the Q1 sales figures today.
 
2
00:00:05,100 --> 00:00:09,300
According to the latest report, overall growth was up 23%.

JSON format (includes word-level timestamps, good for further processing):

/whisper ~/recording.mp3 --format json

Step 6: Transcribe Then Auto-Summarize

A complete workflow — transcribe a meeting and generate minutes in one instruction:

Transcribe this meeting recording, then generate meeting minutes (including decisions, action items, and owners):
~/Recordings/weekly-sync.mp3

OpenClaw will first call the Whisper Skill to transcribe, then call the Summarize Skill (Lesson 10) to generate structured minutes.

Step 7: Meeting Recording Practical Workflow

Complete flow: Record → Transcribe → Minutes → Save

1. After recording ends, send the file path to OpenClaw
2. Instruction: Convert this meeting recording to meeting minutes and save to Obsidian's /Meeting Notes/ directory:
   ~/Recordings/2026-03-30-product-meeting.mp3

OpenClaw executes:

Whisper Skill transcribes locally, generates a text draft
Summarize Skill distills the meeting minutes
Saves to the specified directory via Obsidian MCP

To archive in Notion instead, replace the last step with the Notion MCP or API Gateway Skill (Lesson 25).

FAQ

Can OpenClaw Whisper recognize non-English languages?

Yes. Whisper medium and large models achieve over 90% accuracy for many languages and is one of the best open-source speech recognition models available. For best results, use --model medium or --model large, and specify the language explicitly with --language <lang_code> to avoid mixed-language recordings being misidentified. Regional dialects and accents may have limited effectiveness.

Does Whisper local transcription need a GPU?

No — CPU works too, just slower. On Apple Silicon (M1/M2/M3/M4), Whisper automatically uses Metal GPU acceleration, with performance close to a discrete GPU. On Intel CPUs, the small or medium model is recommended — the large model processing 1 hour of audio on CPU takes about 15–30 minutes. Install with pip install openai-whisper torch to enable PyTorch acceleration.

What's the transcription speed? How long for a 1-hour recording?

Speed varies by hardware and model: Apple M2 Pro with medium model: ~3–5 minutes for a 1-hour recording; Apple M2 Pro with large model: ~8–12 minutes; standard Intel i7 CPU with medium model: ~20–40 minutes. With an NVIDIA GPU, the large model can transcribe a 1-hour recording in 2–3 minutes.

Where are the SRT subtitle files saved?

With the --format srt parameter, Whisper saves the SRT file in the same directory as the audio file by default, using the same filename with the extension changed to .srt. For example, ~/Downloads/meeting.mp3 produces ~/Downloads/meeting.srt. You can also ask AI in your instruction to copy the SRT file to a specific location.

Next Steps

Lesson 24 — Learn to write your own Skill and publish it to ClawHub
Lesson 10 — Combine with the Summarize Skill to automatically generate summaries from transcripts