Back to Skills
    🦞

    local-stt

    Local STT with selectable backends - Parakeet (best accuracy) or Whisper.

    By @araa47
    View on GitHub
    SKILL.md
    ---
    name: local-stt
    description: Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).
    metadata: {"openclaw":{"emoji":"🎙️","requires":{"bins":["ffmpeg"]}}}
    ---
    
    # Local STT (Parakeet / Whisper)
    
    Unified local speech-to-text using ONNX Runtime with int8 quantization. Choose your backend:
    
    - **Parakeet** (default): Best accuracy for English, correctly captures names and filler words
    - **Whisper**: Fastest inference, supports 99 languages
    
    ## Usage
    
    ```bash
    # Default: Parakeet v2 (best English accuracy)
    ~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg
    
    # Explicit backend selection
    ~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b whisper
    ~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b parakeet -m v3
    
    # Quiet mode (suppress progress)
    ~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg --quiet
    ```
    
    ## Options
    
    - `-b/--backend`: `parakeet` (default), `whisper`
    - `-m/--model`: Model variant (see below)
    - `--no-int8`: Disable int8 quantization
    - `-q/--quiet`: Suppress progress
    - `--room-id`: Matrix room ID for direct message
    
    ## Models
    
    ### Parakeet (default backend)
    | Model | Description |
    |-------|-------------|
    | **v2** (default) | English only, best accuracy |
    | v3 | Multilingual |
    
    ### Whisper
    | Model | Description |
    |-------|-------------|
    | tiny | Fastest, lower accuracy |
    | **base** (default) | Good balance |
    | small | Better accuracy |
    | large-v3-turbo | Best quality, slower |
    
    ## Benchmark (24s audio)
    
    | Backend/Model | Time | RTF | Notes |
    |---------------|------|-----|-------|
    | Whisper Base int8 | 0.43s | 0.018x | Fastest |
    | **Parakeet v2 int8** | 0.60s | 0.025x | Best accuracy |
    | Parakeet v3 int8 | 0.63s | 0.026x | Multilingual |
    
    ## openclaw.json
    
    ```json
    {
      "tools": {
        "media": {
          "audio": {
            "enabled": true,
            "models": [
              {
                "type": "cli",
                "command": "~/.openclaw/skills/local-stt/scripts/local-stt.py",
                "args": ["--quiet", "{{MediaPath}}"],
                "timeoutSeconds": 30
              }
            ]
          }
        }
      }
    }
    ```