Back to Skills
    🦞

    pocket-tts

    pocket-tts

    By @sherajdev
    View on GitHub
    SKILL.md
    # Pocket TTS Skill
    
    Fully local, offline text-to-speech using Kyutai's Pocket TTS model. Generate high-quality audio from text without any API calls or internet connection. Features 8 built-in voices, voice cloning support, and runs entirely on CPU.
    
    ## Features
    
    - 🎯 **Fully local** - No API calls, runs completely offline
    - 🚀 **CPU-only** - No GPU required, works on any computer
    - ⚡ **Fast generation** - ~2-6x real-time on CPU
    - 🎤 **8 built-in voices** - alba, marius, javert, jean, fantine, cosette, eponine, azelma
    - 🎭 **Voice cloning** - Clone any voice from a WAV sample
    - 🔊 **Low latency** - ~200ms first audio chunk
    - 📚 **Simple Python API** - Easy integration into any project
    
    ## Installation
    
    ```bash
    # 1. Accept the model license on Hugging Face
    # https://huggingface.co/kyutai/pocket-tts
    
    # 2. Install the package
    pip install pocket-tts
    
    # Or use uv for automatic dependency management
    uvx pocket-tts generate "Hello world"
    ```
    
    ## Usage
    
    ### CLI
    
    ```bash
    # Basic usage
    pocket-tts "Hello, I am your AI assistant"
    
    # With specific voice
    pocket-tts "Hello" --voice alba --output hello.wav
    
    # With custom voice file (voice cloning)
    pocket-tts "Hello" --voice-file myvoice.wav --output output.wav
    
    # Adjust speed
    pocket-tts "Hello" --speed 1.2
    
    # Start local server
    pocket-tts --serve
    
    # List available voices
    pocket-tts --list-voices
    ```
    
    ### Python API
    
    ```python
    from pocket_tts import TTSModel
    import scipy.io.wavfile
    
    # Load model
    tts_model = TTSModel.load_model()
    
    # Get voice state
    voice_state = tts_model.get_state_for_audio_prompt(
        "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
    )
    
    # Generate audio
    audio = tts_model.generate_audio(voice_state, "Hello world!")
    
    # Save to WAV
    scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy())
    
    # Check sample rate
    print(f"Sample rate: {tts_model.sample_rate} Hz")
    ```
    
    ## Available Voices
    
    | Voice | Description |
    |-------|-------------|
    | alba | Casual female voice |
    | marius | Male voice |
    | javert | Clear male voice |
    | jean | Natural male voice |
    | fantine | Female voice |
    | cosette | Female voice |
    | eponine | Female voice |
    | azelma | Female voice |
    
    Or use `--voice-file /path/to/wav.wav` for custom voice cloning.
    
    ## Options
    
    | Option | Description | Default |
    |--------|-------------|---------|
    | `text` | Text to convert | Required |
    | `-o, --output` | Output WAV file | `output.wav` |
    | `-v, --voice` | Voice preset | `alba` |
    | `-s, --speed` | Speech speed (0.5-2.0) | `1.0` |
    | `--voice-file` | Custom WAV for cloning | None |
    | `--serve` | Start HTTP server | False |
    | `--list-voices` | List all voices | False |
    
    ## Requirements
    
    - Python 3.10-3.14
    - PyTorch 2.5+ (CPU version works)
    - Works on 2 CPU cores
    
    ## Notes
    
    - ⚠️ Model is gated - accept license on Hugging Face first
    - 🌍 English language only (v1)
    - 💾 First run downloads model (~100M parameters)
    - 🔊 Audio is returned as 1D torch tensor (PCM data)
    
    ## Links
    
    - [Demo](https://kyutai.org/tts)
    - [GitHub](https://github.com/kyutai-labs/pocket-tts)
    - [Hugging Face](https://huggingface.co/kyutai/pocket-tts)
    - [Paper](https://arxiv.org/abs/2509.06926)