Back to Skills
    🦞

    gemini-stt

    Transcribe audio files using Google's Gemini API or Vertex AI.

    By @araa47
    View on GitHub
    SKILL.md
    ---
    name: gemini-stt
    description: Transcribe audio files using Google's Gemini API or Vertex AI
    metadata: {"clawdbot":{"emoji":"🎤","os":["linux","darwin"]}}
    ---
    
    # Gemini Speech-to-Text Skill
    
    Transcribe audio files using Google's Gemini API or Vertex AI. Default model is `gemini-2.0-flash-lite` for fastest transcription.
    
    ## Authentication (choose one)
    
    ### Option 1: Vertex AI with Application Default Credentials (Recommended)
    
    ```bash
    gcloud auth application-default login
    gcloud config set project YOUR_PROJECT_ID
    ```
    
    The script will automatically detect and use ADC when available.
    
    ### Option 2: Direct Gemini API Key
    
    Set `GEMINI_API_KEY` in environment (e.g., `~/.env` or `~/.clawdbot/.env`)
    
    ## Requirements
    
    - Python 3.10+ (no external dependencies)
    - Either GEMINI_API_KEY or gcloud CLI with ADC configured
    
    ## Supported Formats
    
    - `.ogg` / `.opus` (Telegram voice messages)
    - `.mp3`
    - `.wav`
    - `.m4a`
    
    ## Usage
    
    ```bash
    # Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
    python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg
    
    # Force Vertex AI
    python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex
    
    # With a specific model
    python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro
    
    # Vertex AI with specific project and region
    python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1
    
    # With Clawdbot media
    python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg
    ```
    
    ## Options
    
    | Option | Description |
    |--------|-------------|
    | `<audio_file>` | Path to the audio file (required) |
    | `--model`, `-m` | Gemini model to use (default: `gemini-2.0-flash-lite`) |
    | `--vertex`, `-v` | Force use of Vertex AI with ADC |
    | `--project`, `-p` | GCP project ID (for Vertex, defaults to gcloud config) |
    | `--region`, `-r` | GCP region (for Vertex, default: `us-central1`) |
    
    ## Supported Models
    
    Any Gemini model that supports audio input can be used. Recommended models:
    
    | Model | Notes |
    |-------|-------|
    | `gemini-2.0-flash-lite` | **Default.** Fastest transcription speed. |
    | `gemini-2.0-flash` | Fast and cost-effective. |
    | `gemini-2.5-flash-lite` | Lightweight 2.5 model. |
    | `gemini-2.5-flash` | Balanced speed and quality. |
    | `gemini-2.5-pro` | Higher quality, slower. |
    | `gemini-3-flash-preview` | Latest flash model. |
    | `gemini-3-pro-preview` | Latest pro model, best quality. |
    
    See [Gemini API Models](https://ai.google.dev/gemini-api/docs/models) for the latest list.
    
    ## How It Works
    
    1. Reads the audio file and base64 encodes it
    2. Auto-detects authentication:
       - If ADC is available (gcloud), uses Vertex AI endpoint
       - Otherwise, uses GEMINI_API_KEY with direct Gemini API
    3. Sends to the selected Gemini model with transcription prompt
    4. Returns the transcribed text
    
    ## Example Integration
    
    For Clawdbot voice message handling:
    
    ```bash
    # Transcribe incoming voice message
    TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
    echo "User said: $TRANSCRIPT"
    ```
    
    ## Error Handling
    
    The script exits with code 1 and prints to stderr on:
    - No authentication available (neither ADC nor GEMINI_API_KEY)
    - File not found
    - API errors
    - Missing GCP project (when using Vertex)
    
    ## Notes
    
    - Uses Gemini 2.0 Flash Lite by default for fastest transcription
    - No external Python dependencies (uses stdlib only)
    - Automatically detects MIME type from file extension
    - Prefers Vertex AI with ADC when available (no API key management needed)