🦞
audio-gen

Generate audiobooks, podcasts, or educational audio content
SKILL.md
---
name: audio-gen
description: Generate audiobooks, podcasts, or educational audio content on demand. User provides an idea or topic, Claude AI writes a script, and ElevenLabs converts it to high-quality audio. Supports multiple formats (audiobook, podcast, educational), custom lengths, and voice effects. Use when asked to create audio content, make a podcast, generate an audiobook, or produce educational audio. Returns MP3 audio file via MEDIA token.
homepage: https://github.com/clawdbot/clawdbot
metadata: {"clawdbot":{"emoji":"🎙️","requires":{"skills":["sag"],"env":["ANTHROPIC_API_KEY","ELEVENLABS_API_KEY"]},"primaryEnv":"ANTHROPIC_API_KEY"}}
---

# 🎙️ Audio Content Generator

Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.

## Quick Start

**Create an audiobook chapter:**
```
User: "Create a 5-minute audiobook chapter about a dragon discovering friendship"
```

**Generate a podcast:**
```
User: "Make a 10-minute podcast about the history of coffee"
```

**Produce educational content:**
```
User: "Generate a 15-minute educational audio explaining how neural networks work"
```

## Content Formats

### Audiobook
**Style:** Narrative storytelling with emotional depth
- Clear beginning, middle, and end
- Descriptive language and vivid imagery
- Dramatic pacing with thoughtful pauses
- Emotional tone that matches the story
- Use voice effects like `[whispers]`, `[excited]`, `[serious]` for impact

**Example Structure:**
```
[Opening hook - set the scene]
[long pause]

[Story development with character emotions]
[short pause] between sentences
[long pause] between paragraphs

[Climax with dramatic tension]
[long pause]

[Resolution and emotional closure]
```

### Podcast
**Style:** Conversational and engaging
- Warm, welcoming intro (15-30 seconds)
- Main content with natural flow
- Transitions between topics
- Memorable outro with key takeaways
- Conversational tone throughout

**Example Structure:**
```
**Intro:** "Welcome to [topic]. I'm excited to share..."
[short pause]

**Main Content:** "Let's start with... [topic 1]"
[long pause] between segments

**Outro:** "Thanks for listening! Remember..."
```

### Educational Content
**Style:** Clear explanations for learning
- Simple introductions to complex topics
- Step-by-step breakdowns
- Real-world examples and analogies
- Recap of key concepts at the end
- Enthusiastic delivery with `[excited]` for important points

**Example Structure:**
```
**Introduction:** What is [topic] and why it matters?

**Main Content:**
- Concept 1: Explanation + Example
- Concept 2: Explanation + Example
- Concept 3: Explanation + Example

**Summary:** Key takeaways and next steps
```

## Length Guidelines

**Word Count to Duration Conversion:**
- 5 minutes = ~375 words
- 10 minutes = ~750 words
- 15 minutes = ~1,125 words
- 20 minutes = ~1,500 words
- 30 minutes = ~2,250 words

**Pacing:** Average conversational speed is ~75 words per minute

**Practical Limits:**
- Minimum: 2 minutes (~150 words)
- Maximum: 30 minutes (~2,250 words)
- Sweet spot: 5-15 minutes for best engagement

## Workflow Instructions

### Step 1: Understand the Request

Parse the user's request for:
1. **Content type** (audiobook, podcast, educational, or inferred from topic)
2. **Topic/theme** (what should the content be about)
3. **Target length** (how many minutes)
4. **Tone/style** (dramatic, casual, educational, etc.)
5. **Special requests** (specific voice, emphasis on certain points)

### Step 2: Calculate Word Count

```
target_words = target_minutes × 75
```

Example: 10 minutes = 10 × 75 = 750 words

### Step 3: Generate the Script

Write the complete script following these rules:

**Content Guidelines:**
- Start strong with an engaging hook
- Maintain natural, conversational flow
- Use active voice and simple sentence structure
- Include relevant examples and stories
- End with a satisfying conclusion

**Formatting Rules:**
- Add `[short pause]` after sentences (use sparingly, not every sentence)
- Add `[long pause]` between paragraphs or major sections
- Use voice effects strategically: `[whispers]`, `[shouts]`, `[excited]`, `[serious]`, `[sarcastic]`, `[sings]`, `[laughs]`
- Write numbers as words: "twenty-three" not "23"
- Spell out acronyms first time: "AI, or artificial intelligence"
- Avoid complex punctuation (em-dashes work, but semicolons don't read well)
- Remove markdown formatting before TTS conversion

### Step 4: Present the Script

Show the script to the user and ask:
```
Here's the [format] script I've created (approximately [length] minutes):

[Display the script]

Would you like me to:
1. Generate the audio now
2. Make changes to the script
3. Adjust the length or tone
```

### Step 5: Handle User Feedback

If user requests changes:
- Regenerate the script with adjustments
- Maintain the target word count
- Present the revised version

If user approves:
- Proceed to audio generation

### Step 6: Generate Audio

**Format the script for TTS:**
1. Remove any remaining markdown (headers, bold, italics)
2. Ensure voice effects are in proper `[effect]` format
3. Check that pauses are appropriately placed
4. Verify numbers and acronyms are spelled out

**Invoke the TTS script:**

**IMPORTANT:** The `ELEVENLABS_API_KEY` environment variable is already configured in the system. Simply invoke the TTS script directly.

```bash
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "[formatted_script]"
```

**For long scripts, use heredoc:**
```bash
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
  -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
  -m eleven_multilingual_v2 \
  "$(cat <<'EOF'
[formatted_script]
EOF
)"
```

**Return the result:**
```
MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3

Your [format] is ready! [Brief description of content]. Duration: approximately [X] minutes.
```

## Voice Effects (SSML Tags)

Available voice modulation effects (use sparingly for impact):

- `[whispers]` - Soft, intimate delivery
- `[shouts]` - Loud, emphatic delivery
- `[excited]` - Enthusiastic, energetic tone
- `[serious]` - Grave, solemn tone
- `[sarcastic]` - Ironic, mocking tone
- `[sings]` - Musical, melodic delivery
- `[laughs]` - Amused, jovial tone
- `[short pause]` - Brief silence (~0.5s)
- `[long pause]` - Extended silence (~1-2s)

**Best Practices:**
- Use effects for emotional moments, not every sentence
- Pauses are your most powerful tool for pacing
- Voice effects work best in audiobooks and dramatic content
- Keep podcasts and educational content mostly natural

## Error Handling

### Script Too Long
If the generated script exceeds target by >20%:
```
The script I generated is [X] words ([Y] minutes), which is longer than your target of [Z] minutes. Would you like me to:
1. Condense it to fit the target length
2. Split it into multiple parts
3. Keep it as is
```

### Script Too Short
If the generated script is under target by >20%:
```
The script is [X] words ([Y] minutes), shorter than your target. Would you like me to:
1. Expand it with more detail
2. Add additional examples or stories
3. Generate as is
```

### TTS Generation Fails
If the TTS script fails:
```
I've created the script, but I'm unable to generate the audio right now. Here's your script:

[Display script]

Error: [specific error message]

You can:
1. Check that ELEVENLABS_API_KEY is configured
2. Use the script with your own text-to-speech tool
3. Try again in a moment
4. Ask me to troubleshoot the audio generation
```

**Common TTS Issues:**
- API key not set: Verify ELEVENLABS_API_KEY in config
- Rate limit: Wait a moment and try again
- Text too long: Break into smaller chunks (max ~5000 characters)

### Invalid Request
For unrealistic requests (e.g., "100-hour audiobook"):
```
That length would require [X] words and take significant time to generate. I recommend:
- Breaking it into multiple episodes/chapters
- Targeting 5-30 minutes per audio file
- Creating a series instead of one long file
```

## Tips for Best Results

### For Engaging Audiobooks
- Focus on character emotions and sensory details
- Use pauses to build dramatic tension
- Vary sentence length for rhythm
- Include internal monologue and reflection

### For Compelling Podcasts
- Start with a question or surprising fact
- Use conversational phrases: "You know what's interesting..."
- Include relatable examples from everyday life
- End with actionable takeaways

### For Effective Educational Content
- Use the "explain like I'm five" approach
- Build from simple to complex concepts
- Repeat key terms and definitions
- Provide multiple examples for clarity

## Technical Notes

**TTS Implementation:**
- Uses Python script: `~/.clawdbot/clawdbot/skills/sag/scripts/tts.py`
- No binary installation required (pure Python + requests)
- Directly calls ElevenLabs API
- Compatible with Linux and macOS

**File Storage:**
- Audio files are saved to `/tmp/audio-gen/`
- Filename format: `audio-gen-[timestamp]-[topic-slug].mp3`
- Files are automatically cleaned up after 24 hours

**API Requirements:**
- Anthropic API for script generation (already configured)
- ElevenLabs API for text-to-speech (configured via ELEVENLABS_API_KEY)
- Both services must be configured and have available credits

**Supported Models:**
- `eleven_multilingual_v2` - Best quality (default)
- `eleven_turbo_v2` - Faster generation
- `eleven_turbo_v2_5` - Fastest generation
- `eleven_multilingual_v1` - Legacy model

**Cost Estimate:**
- 10-minute audio (~750 words): approximately $1.43
  - Claude API: ~$0.075
  - ElevenLabs: ~$1.35
- Longer content scales proportionally

**Generation Time:**
- Script generation: 5-30 seconds (depending on length)
- Audio generation: 5-15 seconds (ElevenLabs processing)
- Total: Usually under 1 minute for 10-minute audio

## Limitations

1. **Maximum Length:** 30 mi

... (truncated)