Back to Skills
    🦞

    podcast-generation

    Generate AI-powered podcast-style audio narratives

    By @thegovind
    View on GitHub
    SKILL.md
    ---
    name: podcast-generation
    description: Generate AI-powered podcast-style audio narratives using Azure OpenAI's GPT Realtime Mini model via WebSocket. Use when building text-to-speech features, audio narrative generation, podcast creation from content, or integrating with Azure OpenAI Realtime API for real audio output. Covers full-stack implementation from React frontend to Python FastAPI backend with WebSocket streaming.
    ---
    
    # Podcast Generation with GPT Realtime Mini
    
    Generate real audio narratives from text content using Azure OpenAI's Realtime API.
    
    ## Quick Start
    
    1. Configure environment variables for Realtime API
    2. Connect via WebSocket to Azure OpenAI Realtime endpoint
    3. Send text prompt, collect PCM audio chunks + transcript
    4. Convert PCM to WAV format
    5. Return base64-encoded audio to frontend for playback
    
    ## Environment Configuration
    
    ```env
    AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
    AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
    AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini
    ```
    
    **Note**: Endpoint should NOT include `/openai/v1/` - just the base URL.
    
    ## Core Workflow
    
    ### Backend Audio Generation
    
    ```python
    from openai import AsyncOpenAI
    import base64
    
    # Convert HTTPS endpoint to WebSocket URL
    ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
    
    client = AsyncOpenAI(
        websocket_base_url=ws_url,
        api_key=api_key
    )
    
    audio_chunks = []
    transcript_parts = []
    
    async with client.realtime.connect(model="gpt-realtime-mini") as conn:
        # Configure for audio-only output
        await conn.session.update(session={
            "output_modalities": ["audio"],
            "instructions": "You are a narrator. Speak naturally."
        })
        
        # Send text to narrate
        await conn.conversation.item.create(item={
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": prompt}]
        })
        
        await conn.response.create()
        
        # Collect streaming events
        async for event in conn:
            if event.type == "response.output_audio.delta":
                audio_chunks.append(base64.b64decode(event.delta))
            elif event.type == "response.output_audio_transcript.delta":
                transcript_parts.append(event.delta)
            elif event.type == "response.done":
                break
    
    # Convert PCM to WAV (see scripts/pcm_to_wav.py)
    pcm_audio = b''.join(audio_chunks)
    wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
    ```
    
    ### Frontend Audio Playback
    
    ```javascript
    // Convert base64 WAV to playable blob
    const base64ToBlob = (base64, mimeType) => {
      const bytes = atob(base64);
      const arr = new Uint8Array(bytes.length);
      for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
      return new Blob([arr], { type: mimeType });
    };
    
    const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
    const audioUrl = URL.createObjectURL(audioBlob);
    new Audio(audioUrl).play();
    ```
    
    ## Voice Options
    
    | Voice | Character |
    |-------|-----------|
    | alloy | Neutral |
    | echo | Warm |
    | fable | Expressive |
    | onyx | Deep |
    | nova | Friendly |
    | shimmer | Clear |
    
    ## Realtime API Events
    
    - `response.output_audio.delta` - Base64 audio chunk
    - `response.output_audio_transcript.delta` - Transcript text
    - `response.done` - Generation complete
    - `error` - Handle with `event.error.message`
    
    ## Audio Format
    
    - **Input**: Text prompt
    - **Output**: PCM audio (24kHz, 16-bit, mono)
    - **Storage**: Base64-encoded WAV
    
    ## References
    
    - **Full architecture**: See [references/architecture.md](references/architecture.md) for complete stack design
    - **Code examples**: See [references/code-examples.md](references/code-examples.md) for production patterns
    - **PCM conversion**: Use [scripts/pcm_to_wav.py](scripts/pcm_to_wav.py) for audio format conversion