Back to Skills
    🦞

    faster-whisper

    Local speech-to-text using faster-whisper.

    By @theplasmak
    View on GitHub
    SKILL.md
    ---
    name: faster-whisper
    description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
    version: 1.0.4
    author: ThePlasmak
    homepage: https://github.com/ThePlasmak/faster-whisper
    tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"]
    platforms: ["windows", "linux", "macos", "wsl2"]
    metadata: {"moltbot":{"emoji":"🗣️","requires":{"bins":["ffmpeg","python3"]}}}
    ---
    
    # Faster Whisper
    
    Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs **4-6x faster** with identical accuracy. With GPU acceleration, expect **~20x realtime** transcription (a 10-minute audio file in ~30 seconds).
    
    ## When to Use
    
    Use this skill when you need to:
    - **Transcribe audio/video files** — meetings, interviews, podcasts, lectures, YouTube videos
    - **Convert speech to text locally** — no API costs, works offline (after model download)
    - **Batch process multiple audio files** — efficient for large collections
    - **Generate subtitles/captions** — word-level timestamps available
    - **Multilingual transcription** — supports 99+ languages with auto-detection
    
    **Trigger phrases:** "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"
    
    **When NOT to use:**
    - Real-time/streaming transcription (use streaming-optimized tools instead)
    - Cloud-only environments without local compute
    - Files <10 seconds where API call latency doesn't matter
    
    ## Quick Reference
    
    | Task | Command | Notes |
    |------|---------|-------|
    | **Basic transcription** | `./scripts/transcribe audio.mp3` | Uses default distil-large-v3 |
    | **Faster English** | `./scripts/transcribe audio.mp3 --model distil-medium.en --language en` | English-only, 6.8x faster |
    | **Maximum accuracy** | `./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10` | Slower but best quality |
    | **Word timestamps** | `./scripts/transcribe audio.mp3 --word-timestamps` | For subtitles/captions |
    | **JSON output** | `./scripts/transcribe audio.mp3 --json -o output.json` | Programmatic access |
    | **Multilingual** | `./scripts/transcribe audio.mp3 --model large-v3-turbo` | Auto-detects language |
    | **Remove silence** | `./scripts/transcribe audio.mp3 --vad` | Voice activity detection |
    
    ## Model Selection
    
    Choose the right model for your needs:
    
    ```dot
    digraph model_selection {
        rankdir=LR;
        node [shape=box, style=rounded];
    
        start [label="Start", shape=doublecircle];
        need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
        multilingual [label="Multilingual\ncontent?", shape=diamond];
        resource_constrained [label="Resource\nconstraints?", shape=diamond];
    
        large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
        large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
        distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
        distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
        distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];
    
        start -> need_accuracy;
        need_accuracy -> large_v3 [label="yes"];
        need_accuracy -> multilingual [label="no"];
        multilingual -> large_turbo [label="yes"];
        multilingual -> resource_constrained [label="no (English)"];
        resource_constrained -> distil_small [label="mobile/edge"];
        resource_constrained -> distil_medium [label="some limits"];
        resource_constrained -> distil_large [label="no"];
    }
    ```
    
    ### Model Table
    
    #### Standard Models (Full Whisper)
    | Model | Size | Speed | Accuracy | Use Case |
    |-------|------|-------|----------|----------|
    | `tiny` / `tiny.en` | 39M | Fastest | Basic | Quick drafts |
    | `base` / `base.en` | 74M | Very fast | Good | General use |
    | `small` / `small.en` | 244M | Fast | Better | Most tasks |
    | `medium` / `medium.en` | 769M | Moderate | High | Quality transcription |
    | `large-v1/v2/v3` | 1.5GB | Slower | Best | Maximum accuracy |
    | **`large-v3-turbo`** | 809M | Fast | Excellent | **Recommended for accuracy** |
    
    #### Distilled Models (~6x Faster, ~1% WER difference)
    | Model | Size | Speed vs Standard | Accuracy | Use Case |
    |-------|------|-------------------|----------|----------|
    | **`distil-large-v3`** | 756M | ~6.3x faster | 9.7% WER | **Default, best balance** |
    | `distil-large-v2` | 756M | ~5.8x faster | 10.1% WER | Fallback |
    | `distil-medium.en` | 394M | ~6.8x faster | 11.1% WER | English-only, resource-constrained |
    | `distil-small.en` | 166M | ~5.6x faster | 12.1% WER | Mobile/edge devices |
    
    `.en` models are English-only and slightly faster/better for English content.
    
    ## Setup
    
    ### Linux / macOS / WSL2
    ```bash
    # Run the setup script (creates venv, installs deps, auto-detects GPU)
    ./setup.sh
    ```
    
    ### Windows (Native)
    ```powershell
    # Run from PowerShell (auto-installs Python & ffmpeg if missing via winget)
    .\setup.ps1
    ```
    
    The Windows setup script will:
    - Auto-install Python 3.12 via winget if not found
    - Auto-install ffmpeg via winget if not found
    - Detect NVIDIA GPU and install CUDA-enabled PyTorch
    - Create venv and install all dependencies
    
    Requirements:
    - **Linux/macOS/WSL2**: Python 3.10+, ffmpeg
    - **Windows**: Nothing! Setup auto-installs prerequisites via winget
    
    ### Platform Support
    
    | Platform | Acceleration | Speed | Auto-Install |
    |----------|-------------|-------|--------------|
    | **Windows + NVIDIA GPU** | CUDA | ~20x realtime 🚀 | ✅ Full |
    | **Linux + NVIDIA GPU** | CUDA | ~20x realtime 🚀 | Manual prereqs |
    | **WSL2 + NVIDIA GPU** | CUDA | ~20x realtime 🚀 | Manual prereqs |
    | macOS Apple Silicon | CPU* | ~3-5x realtime | Manual prereqs |
    | macOS Intel | CPU | ~1-2x realtime | Manual prereqs |
    | Windows (no GPU) | CPU | ~1x realtime | ✅ Full |
    | Linux (no GPU) | CPU | ~1x realtime | Manual prereqs |
    
    *faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.
    
    ### GPU Support (IMPORTANT!)
    
    The setup script auto-detects your GPU and installs PyTorch with CUDA. **Always use GPU if available** — CPU transcription is extremely slow.
    
    | Hardware | Speed | 9-min video |
    |----------|-------|-------------|
    | RTX 3070 (GPU) | ~20x realtime | ~27 sec |
    | CPU (int8) | ~0.3x realtime | ~30 min |
    
    If setup didn't detect your GPU, manually install PyTorch with CUDA:
    
    **Linux/macOS/WSL2:**
    ```bash
    # For CUDA 12.x
    uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121
    
    # For CUDA 11.x
    uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118
    ```
    
    **Windows:**
    ```powershell
    # For CUDA 12.x
    .venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
    
    # For CUDA 11.x
    .venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu118
    ```
    
    - **Windows users**: Ensure you have [NVIDIA drivers](https://www.nvidia.com/download/index.aspx) installed
    - **WSL2 users**: Ensure you have the [NVIDIA CUDA drivers for WSL](https://docs.nvidia.com/cuda/wsl-user-guide/) installed on Windows
    
    ## Usage
    
    **Linux/macOS/WSL2:**
    ```bash
    # Basic transcription
    ./scripts/transcribe audio.mp3
    
    # With specific model
    ./scripts/transcribe audio.wav --model large-v3-turbo
    
    # With word timestamps
    ./scripts/transcribe audio.mp3 --word-timestamps
    
    # Specify language (faster than auto-detect)
    ./scripts/transcribe audio.mp3 --language en
    
    # JSON output
    ./scripts/transcribe audio.mp3 --json
    ```
    
    **Windows (cmd or PowerShell):**
    ```powershell
    # Basic transcription
    .\scripts\transcribe.cmd audio.mp3
    
    # With specific model
    .\scripts\transcribe.cmd audio.wav --model large-v3-turbo
    
    # With word timestamps (PowerShell native syntax also works)
    .\scripts\transcribe.ps1 audio.mp3 -WordTimestamps
    
    # JSON output
    .\scripts\transcribe.cmd audio.mp3 --json
    ```
    
    ## Options
    
    ```
    --model, -m        Model name (default: distil-large-v3)
    --language, -l     Language code (e.g., en, es, fr - auto-detect if omitted)
    --word-timestamps  Include word-level timestamps
    --beam-size        Beam search size (default: 5, higher = more accurate but slower)
    --vad              Enable voice activity detection (removes silence)
    --json, -j         Output as JSON
    --output, -o       Save transcript to file
    --device           cpu or cuda (auto-detected)
    --compute-type     int8, float16, float32 (default: auto-optimized)
    --quiet, -q        Suppress progress messages
    ```
    
    ## Examples
    
    ```bash
    # Transcribe YouTube audio (after extraction with yt-dlp)
    yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
    ./scripts/transcribe audio.mp3
    
    # Batch transcription with JSON output
    for file in *.mp3; do
      ./scripts/transcribe "$file" --json > "${file%.mp3}.json"
    done
    
    # High-accuracy transcription with larger beam size
    ./scripts/transcribe audio.mp3 \
      --model large-v3-turbo --beam-size 10 --word-timestamps
    
    # Fast English-only transcription
    ./scripts/transcribe audio.mp3 \
      --model distil-medium.en --language en
    
    # Transcribe with VAD (removes silence)
    ./scripts/transcribe audio.mp3 --vad
    ```
    
    ## Common Mistakes
    
    | Mistake | Problem | Solution |
    |---------|---------|----------|
    | **Using CPU when GPU available** | 10-20x slower transcription | Check `nvidia-smi` on Windows/Linux; verify CUDA installation |
    | **Not specifying language** | Wastes time auto-detecting on known content | Use `--language en` when you know the language |
    | **Using wrong model** | Unnecessary slowness or poor accuracy | Default `distil-large-v3` is excellent; only use `large-v3` if accuracy issues |
    | **Ignoring distilled models** | Missing 6x speedup with <1% accuracy loss | Try `distil-large-v3` before reaching for standard models |
    | **Forgetting ffmpeg** | Setup fails or audio can't be processed | Setup script handles th
    
    ... (truncated)