Tutorials 23

Lesson 23 — OpenClaw Whisper Skill: Local Offline Speech-to-Text, Zero API Cost (2026)

Goal: Install the Whisper Skill so OpenClaw can transcribe audio files locally and offline, supporting multiple languages, without relying on any cloud API.


Local Whisper vs. Cloud API Comparison

Before diving in, understand the differences between the two approaches:

Dimension Whisper Local (this lesson) Cloud API (e.g., OpenAI Whisper API)
Cost Free, unlimited usage ~$0.006/minute, ~$0.36/hour of audio
Privacy Audio never leaves your machine Audio uploaded to cloud servers
Network No internet needed Must be online
Speed Depends on local hardware Typically 10–30 seconds per hour of audio
English accuracy Whisper medium/large: >90% Comparable to local large model

For meeting recordings containing business secrets or personal privacy, the local approach is the only secure option.


Step 1: Install the Whisper CLI

The Whisper Skill depends on the openai-whisper Python command-line tool. Install it first:

pip install openai-whisper

Verify successful installation:

whisper --help
# Should output usage information

Choose a model size (automatically downloaded on first run):

Model Size Best For
tiny 75 MB Quick testing, limited accuracy
base 145 MB Everyday English use
small 483 MB Both English and other languages, good starter
medium 1.5 GB Meeting recordings, recommended
large 3 GB Highest accuracy, top choice for demanding use cases

If you need high accuracy, go straight to the large model — after the first download it runs entirely offline.


Step 2: Install the Skill

In WebChat or Telegram, send:

/install @steipete/openai-whisper

Verify installation:

pnpm openclaw skills list
# openai-whisper should appear in the list

Step 3: Basic Transcription

After installation, send an audio file path to AI:

Transcribe this recording for me: ~/Downloads/meeting.mp3

Or use the slash command:

/whisper ~/Downloads/meeting.mp3

The Skill calls the local Whisper CLI and outputs a plain-text transcript. Supported audio formats: mp3, wav, m4a, flac, ogg, webm.


Step 4: Specify Language

Without specifying a language, Whisper auto-detects, but this can be unstable for mixed-language recordings. It's better to be explicit:

Transcribe this meeting recording in English: ~/Recordings/Q1-summary.m4a
Language: English

The underlying command is equivalent to:

whisper ~/Recordings/Q1-summary.m4a --language en --model medium

For recordings with mixed languages (like a technical meeting), specify --language en (or your primary language) and Whisper will prioritize that language while preserving technical terms from other languages.


Step 5: Output Format Control

The Whisper Skill supports three output formats:

Plain text (default):

/whisper ~/recording.mp3 --format txt

SRT subtitle file with timestamps:

/whisper ~/recording.mp3 --format srt

Sample output:

1
00:00:01,200 --> 00:00:04,800
Good morning everyone, let's discuss the Q1 sales figures today.
 
2
00:00:05,100 --> 00:00:09,300
According to the latest report, overall growth was up 23%.

JSON format (includes word-level timestamps, good for further processing):

/whisper ~/recording.mp3 --format json

Step 6: Transcribe Then Auto-Summarize

A complete workflow — transcribe a meeting and generate minutes in one instruction:

Transcribe this meeting recording, then generate meeting minutes (including decisions, action items, and owners):
~/Recordings/weekly-sync.mp3

OpenClaw will first call the Whisper Skill to transcribe, then call the Summarize Skill (Lesson 10) to generate structured minutes.


Step 7: Meeting Recording Practical Workflow

Complete flow: Record → Transcribe → Minutes → Save

1. After recording ends, send the file path to OpenClaw
2. Instruction: Convert this meeting recording to meeting minutes and save to Obsidian's /Meeting Notes/ directory:
   ~/Recordings/2026-03-30-product-meeting.mp3

OpenClaw executes:

  1. Whisper Skill transcribes locally, generates a text draft
  2. Summarize Skill distills the meeting minutes
  3. Saves to the specified directory via Obsidian MCP

To archive in Notion instead, replace the last step with the Notion MCP or API Gateway Skill (Lesson 25).


FAQ

Can OpenClaw Whisper recognize non-English languages?

Yes. Whisper medium and large models achieve over 90% accuracy for many languages and is one of the best open-source speech recognition models available. For best results, use --model medium or --model large, and specify the language explicitly with --language <lang_code> to avoid mixed-language recordings being misidentified. Regional dialects and accents may have limited effectiveness.

Does Whisper local transcription need a GPU?

No — CPU works too, just slower. On Apple Silicon (M1/M2/M3/M4), Whisper automatically uses Metal GPU acceleration, with performance close to a discrete GPU. On Intel CPUs, the small or medium model is recommended — the large model processing 1 hour of audio on CPU takes about 15–30 minutes. Install with pip install openai-whisper torch to enable PyTorch acceleration.

What's the transcription speed? How long for a 1-hour recording?

Speed varies by hardware and model: Apple M2 Pro with medium model: ~3–5 minutes for a 1-hour recording; Apple M2 Pro with large model: ~8–12 minutes; standard Intel i7 CPU with medium model: ~20–40 minutes. With an NVIDIA GPU, the large model can transcribe a 1-hour recording in 2–3 minutes.

Where are the SRT subtitle files saved?

With the --format srt parameter, Whisper saves the SRT file in the same directory as the audio file by default, using the same filename with the extension changed to .srt. For example, ~/Downloads/meeting.mp3 produces ~/Downloads/meeting.srt. You can also ask AI in your instruction to copy the SRT file to a specific location.


Next Steps

  • Lesson 24 — Learn to write your own Skill and publish it to ClawHub
  • Lesson 10 — Combine with the Summarize Skill to automatically generate summaries from transcripts

Stay up to date with OpenClaw

Follow @lanmiaoai on X for tips, updates and new tutorials.

Follow