Back to Skills
    🦞

    audio-gen

    Generate audiobooks, podcasts, or educational audio content

    By @udiedrichsen
    View on GitHub
    SKILL.md
    ---
    name: audio-gen
    description: Generate audiobooks, podcasts, or educational audio content on demand. User provides an idea or topic, Claude AI writes a script, and ElevenLabs converts it to high-quality audio. Supports multiple formats (audiobook, podcast, educational), custom lengths, and voice effects. Use when asked to create audio content, make a podcast, generate an audiobook, or produce educational audio. Returns MP3 audio file via MEDIA token.
    homepage: https://github.com/clawdbot/clawdbot
    metadata: {"clawdbot":{"emoji":"🎙️","requires":{"skills":["sag"],"env":["ANTHROPIC_API_KEY","ELEVENLABS_API_KEY"]},"primaryEnv":"ANTHROPIC_API_KEY"}}
    ---
    
    # 🎙️ Audio Content Generator
    
    Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.
    
    ## Quick Start
    
    **Create an audiobook chapter:**
    ```
    User: "Create a 5-minute audiobook chapter about a dragon discovering friendship"
    ```
    
    **Generate a podcast:**
    ```
    User: "Make a 10-minute podcast about the history of coffee"
    ```
    
    **Produce educational content:**
    ```
    User: "Generate a 15-minute educational audio explaining how neural networks work"
    ```
    
    ## Content Formats
    
    ### Audiobook
    **Style:** Narrative storytelling with emotional depth
    - Clear beginning, middle, and end
    - Descriptive language and vivid imagery
    - Dramatic pacing with thoughtful pauses
    - Emotional tone that matches the story
    - Use voice effects like `[whispers]`, `[excited]`, `[serious]` for impact
    
    **Example Structure:**
    ```
    [Opening hook - set the scene]
    [long pause]
    
    [Story development with character emotions]
    [short pause] between sentences
    [long pause] between paragraphs
    
    [Climax with dramatic tension]
    [long pause]
    
    [Resolution and emotional closure]
    ```
    
    ### Podcast
    **Style:** Conversational and engaging
    - Warm, welcoming intro (15-30 seconds)
    - Main content with natural flow
    - Transitions between topics
    - Memorable outro with key takeaways
    - Conversational tone throughout
    
    **Example Structure:**
    ```
    **Intro:** "Welcome to [topic]. I'm excited to share..."
    [short pause]
    
    **Main Content:** "Let's start with... [topic 1]"
    [long pause] between segments
    
    **Outro:** "Thanks for listening! Remember..."
    ```
    
    ### Educational Content
    **Style:** Clear explanations for learning
    - Simple introductions to complex topics
    - Step-by-step breakdowns
    - Real-world examples and analogies
    - Recap of key concepts at the end
    - Enthusiastic delivery with `[excited]` for important points
    
    **Example Structure:**
    ```
    **Introduction:** What is [topic] and why it matters?
    
    **Main Content:**
    - Concept 1: Explanation + Example
    - Concept 2: Explanation + Example
    - Concept 3: Explanation + Example
    
    **Summary:** Key takeaways and next steps
    ```
    
    ## Length Guidelines
    
    **Word Count to Duration Conversion:**
    - 5 minutes = ~375 words
    - 10 minutes = ~750 words
    - 15 minutes = ~1,125 words
    - 20 minutes = ~1,500 words
    - 30 minutes = ~2,250 words
    
    **Pacing:** Average conversational speed is ~75 words per minute
    
    **Practical Limits:**
    - Minimum: 2 minutes (~150 words)
    - Maximum: 30 minutes (~2,250 words)
    - Sweet spot: 5-15 minutes for best engagement
    
    ## Workflow Instructions
    
    ### Step 1: Understand the Request
    
    Parse the user's request for:
    1. **Content type** (audiobook, podcast, educational, or inferred from topic)
    2. **Topic/theme** (what should the content be about)
    3. **Target length** (how many minutes)
    4. **Tone/style** (dramatic, casual, educational, etc.)
    5. **Special requests** (specific voice, emphasis on certain points)
    
    ### Step 2: Calculate Word Count
    
    ```
    target_words = target_minutes × 75
    ```
    
    Example: 10 minutes = 10 × 75 = 750 words
    
    ### Step 3: Generate the Script
    
    Write the complete script following these rules:
    
    **Content Guidelines:**
    - Start strong with an engaging hook
    - Maintain natural, conversational flow
    - Use active voice and simple sentence structure
    - Include relevant examples and stories
    - End with a satisfying conclusion
    
    **Formatting Rules:**
    - Add `[short pause]` after sentences (use sparingly, not every sentence)
    - Add `[long pause]` between paragraphs or major sections
    - Use voice effects strategically: `[whispers]`, `[shouts]`, `[excited]`, `[serious]`, `[sarcastic]`, `[sings]`, `[laughs]`
    - Write numbers as words: "twenty-three" not "23"
    - Spell out acronyms first time: "AI, or artificial intelligence"
    - Avoid complex punctuation (em-dashes work, but semicolons don't read well)
    - Remove markdown formatting before TTS conversion
    
    ### Step 4: Present the Script
    
    Show the script to the user and ask:
    ```
    Here's the [format] script I've created (approximately [length] minutes):
    
    [Display the script]
    
    Would you like me to:
    1. Generate the audio now
    2. Make changes to the script
    3. Adjust the length or tone
    ```
    
    ### Step 5: Handle User Feedback
    
    If user requests changes:
    - Regenerate the script with adjustments
    - Maintain the target word count
    - Present the revised version
    
    If user approves:
    - Proceed to audio generation
    
    ### Step 6: Generate Audio
    
    **Format the script for TTS:**
    1. Remove any remaining markdown (headers, bold, italics)
    2. Ensure voice effects are in proper `[effect]` format
    3. Check that pauses are appropriately placed
    4. Verify numbers and acronyms are spelled out
    
    **Invoke the TTS script:**
    
    **IMPORTANT:** The `ELEVENLABS_API_KEY` environment variable is already configured in the system. Simply invoke the TTS script directly.
    
    ```bash
    uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
      -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
      -m eleven_multilingual_v2 \
      "[formatted_script]"
    ```
    
    **For long scripts, use heredoc:**
    ```bash
    uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
      -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
      -m eleven_multilingual_v2 \
      "$(cat <<'EOF'
    [formatted_script]
    EOF
    )"
    ```
    
    **Return the result:**
    ```
    MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3
    
    Your [format] is ready! [Brief description of content]. Duration: approximately [X] minutes.
    ```
    
    ## Voice Effects (SSML Tags)
    
    Available voice modulation effects (use sparingly for impact):
    
    - `[whispers]` - Soft, intimate delivery
    - `[shouts]` - Loud, emphatic delivery
    - `[excited]` - Enthusiastic, energetic tone
    - `[serious]` - Grave, solemn tone
    - `[sarcastic]` - Ironic, mocking tone
    - `[sings]` - Musical, melodic delivery
    - `[laughs]` - Amused, jovial tone
    - `[short pause]` - Brief silence (~0.5s)
    - `[long pause]` - Extended silence (~1-2s)
    
    **Best Practices:**
    - Use effects for emotional moments, not every sentence
    - Pauses are your most powerful tool for pacing
    - Voice effects work best in audiobooks and dramatic content
    - Keep podcasts and educational content mostly natural
    
    ## Error Handling
    
    ### Script Too Long
    If the generated script exceeds target by >20%:
    ```
    The script I generated is [X] words ([Y] minutes), which is longer than your target of [Z] minutes. Would you like me to:
    1. Condense it to fit the target length
    2. Split it into multiple parts
    3. Keep it as is
    ```
    
    ### Script Too Short
    If the generated script is under target by >20%:
    ```
    The script is [X] words ([Y] minutes), shorter than your target. Would you like me to:
    1. Expand it with more detail
    2. Add additional examples or stories
    3. Generate as is
    ```
    
    ### TTS Generation Fails
    If the TTS script fails:
    ```
    I've created the script, but I'm unable to generate the audio right now. Here's your script:
    
    [Display script]
    
    Error: [specific error message]
    
    You can:
    1. Check that ELEVENLABS_API_KEY is configured
    2. Use the script with your own text-to-speech tool
    3. Try again in a moment
    4. Ask me to troubleshoot the audio generation
    ```
    
    **Common TTS Issues:**
    - API key not set: Verify ELEVENLABS_API_KEY in config
    - Rate limit: Wait a moment and try again
    - Text too long: Break into smaller chunks (max ~5000 characters)
    
    ### Invalid Request
    For unrealistic requests (e.g., "100-hour audiobook"):
    ```
    That length would require [X] words and take significant time to generate. I recommend:
    - Breaking it into multiple episodes/chapters
    - Targeting 5-30 minutes per audio file
    - Creating a series instead of one long file
    ```
    
    ## Tips for Best Results
    
    ### For Engaging Audiobooks
    - Focus on character emotions and sensory details
    - Use pauses to build dramatic tension
    - Vary sentence length for rhythm
    - Include internal monologue and reflection
    
    ### For Compelling Podcasts
    - Start with a question or surprising fact
    - Use conversational phrases: "You know what's interesting..."
    - Include relatable examples from everyday life
    - End with actionable takeaways
    
    ### For Effective Educational Content
    - Use the "explain like I'm five" approach
    - Build from simple to complex concepts
    - Repeat key terms and definitions
    - Provide multiple examples for clarity
    
    ## Technical Notes
    
    **TTS Implementation:**
    - Uses Python script: `~/.clawdbot/clawdbot/skills/sag/scripts/tts.py`
    - No binary installation required (pure Python + requests)
    - Directly calls ElevenLabs API
    - Compatible with Linux and macOS
    
    **File Storage:**
    - Audio files are saved to `/tmp/audio-gen/`
    - Filename format: `audio-gen-[timestamp]-[topic-slug].mp3`
    - Files are automatically cleaned up after 24 hours
    
    **API Requirements:**
    - Anthropic API for script generation (already configured)
    - ElevenLabs API for text-to-speech (configured via ELEVENLABS_API_KEY)
    - Both services must be configured and have available credits
    
    **Supported Models:**
    - `eleven_multilingual_v2` - Best quality (default)
    - `eleven_turbo_v2` - Faster generation
    - `eleven_turbo_v2_5` - Fastest generation
    - `eleven_multilingual_v1` - Legacy model
    
    **Cost Estimate:**
    - 10-minute audio (~750 words): approximately $1.43
      - Claude API: ~$0.075
      - ElevenLabs: ~$1.35
    - Longer content scales proportionally
    
    **Generation Time:**
    - Script generation: 5-30 seconds (depending on length)
    - Audio generation: 5-15 seconds (ElevenLabs processing)
    - Total: Usually under 1 minute for 10-minute audio
    
    ## Limitations
    
    1. **Maximum Length:** 30 mi
    
    ... (truncated)