🦞
audio-reply

Generate audio replies using TTS.
SKILL.md
---
name: audio-reply
description: 'Generate audio replies using TTS. Trigger with "read it to me [URL]" to fetch and read content aloud, or "talk to me [topic]" to generate a spoken response. Also responds to "speak", "say it", "voice reply".'
homepage: https://github.com/anthropics/claude-code
metadata: {"clawdbot":{"emoji":"🔊","requires":{"bins":["uv"]}}}
---

# Audio Reply Skill

Generate spoken audio responses using MLX Audio TTS (chatterbox-turbo model).

## Trigger Phrases

- **"read it to me [URL]"** - Fetch content from URL and read it aloud
- **"talk to me [topic/question]"** - Generate a conversational response as audio
- **"speak"**, **"say it"**, **"voice reply"** - Convert your response to audio

## How to Use

### Mode 1: Read URL Content
```
User: read it to me https://example.com/article
```
1. Fetch the URL content using WebFetch
2. Extract readable text (strip HTML, focus on main content)
3. Generate audio using TTS
4. Play the audio and delete the file afterward

### Mode 2: Conversational Audio Response
```
User: talk to me about the weather today
```
1. Generate a natural, conversational response
2. Keep it concise (TTS works best with shorter segments)
3. Convert to audio, play it, then delete the file

## Implementation

### TTS Command
```bash
uv run mlx_audio.tts.generate \
  --model mlx-community/chatterbox-turbo-fp16 \
  --text "Your text here" \
  --play \
  --file_prefix /tmp/audio_reply
```

### Key Parameters
- `--model mlx-community/chatterbox-turbo-fp16` - Fast, natural voice
- `--play` - Auto-play the generated audio
- `--file_prefix` - Save to temp location for cleanup
- `--exaggeration 0.3` - Optional: add expressiveness (0.0-1.0)
- `--speed 1.0` - Adjust speech rate if needed

### Text Preparation Guidelines

**For "read it to me" mode:**
1. Fetch URL with WebFetch tool
2. Extract main content, strip navigation/ads/boilerplate
3. Summarize if very long (>500 words) - keep key points
4. Add natural pauses with periods and commas

**For "talk to me" mode:**
1. Write conversationally, as if speaking
2. Use contractions (I'm, you're, it's)
3. Add filler words sparingly for naturalness ([chuckle], um, anyway)
4. Keep responses under 200 words for best quality
5. Avoid technical jargon unless explaining it

### Audio Generation & Cleanup (IMPORTANT)

Always delete the audio file after playing - it's already in the chat history.

```bash
# Generate with unique filename and play
OUTPUT_FILE="/tmp/audio_reply_$(date +%s)"
uv run mlx_audio.tts.generate \
  --model mlx-community/chatterbox-turbo-fp16 \
  --text "Your response text" \
  --play \
  --file_prefix "$OUTPUT_FILE"

# ALWAYS clean up after playing
rm -f "${OUTPUT_FILE}"*.wav 2>/dev/null
```

### Error Handling

If TTS fails:
1. Check if model is downloaded (first run downloads ~500MB)
2. Ensure `uv` is installed and in PATH
3. Fall back to text response with apology

## Example Workflows

### Example 1: Read URL
```
User: read it to me https://blog.example.com/new-feature

Assistant actions:
1. WebFetch the URL
2. Extract article content
3. Generate TTS:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "Here's what I found... [article summary]" \
     --play --file_prefix /tmp/audio_reply_1706123456
4. Delete: rm -f /tmp/audio_reply_1706123456*.wav
5. Confirm: "Done reading the article to you."
```

### Example 2: Talk to Me
```
User: talk to me about what you can help with

Assistant actions:
1. Generate conversational response text
2. Generate TTS:
   uv run mlx_audio.tts.generate \
     --model mlx-community/chatterbox-turbo-fp16 \
     --text "Hey! So I can help you with all kinds of things..." \
     --play --file_prefix /tmp/audio_reply_1706123789
3. Delete: rm -f /tmp/audio_reply_1706123789*.wav
4. (No text output needed - audio IS the response)
```

## Notes

- First run may take longer as the model downloads (~500MB)
- Audio quality is best for English; other languages may vary
- For long content, consider chunking into multiple audio segments
- The `--play` flag uses system audio - ensure volume is up