Back to Skills
    🦞

    azure-ai-voicelive-py

    Build real-time voice AI applications

    By @thegovind
    View on GitHub
    SKILL.md
    ---
    name: azure-ai-voicelive-py
    description: Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.
    package: azure-ai-voicelive
    ---
    
    # Azure AI Voice Live SDK
    
    Build real-time voice AI applications with bidirectional WebSocket communication.
    
    ## Installation
    
    ```bash
    pip install azure-ai-voicelive aiohttp azure-identity
    ```
    
    ## Environment Variables
    
    ```bash
    AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com
    # For API key auth (not recommended for production)
    AZURE_COGNITIVE_SERVICES_KEY=<api-key>
    ```
    
    ## Authentication
    
    **DefaultAzureCredential (preferred)**:
    ```python
    from azure.ai.voicelive.aio import connect
    from azure.identity.aio import DefaultAzureCredential
    
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        ...
    ```
    
    **API Key**:
    ```python
    from azure.ai.voicelive.aio import connect
    from azure.core.credentials import AzureKeyCredential
    
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
        model="gpt-4o-realtime-preview"
    ) as conn:
        ...
    ```
    
    ## Quick Start
    
    ```python
    import asyncio
    import os
    from azure.ai.voicelive.aio import connect
    from azure.identity.aio import DefaultAzureCredential
    
    async def main():
        async with connect(
            endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
            credential=DefaultAzureCredential(),
            model="gpt-4o-realtime-preview",
            credential_scopes=["https://cognitiveservices.azure.com/.default"]
        ) as conn:
            # Update session with instructions
            await conn.session.update(session={
                "instructions": "You are a helpful assistant.",
                "modalities": ["text", "audio"],
                "voice": "alloy"
            })
            
            # Listen for events
            async for event in conn:
                print(f"Event: {event.type}")
                if event.type == "response.audio_transcript.done":
                    print(f"Transcript: {event.transcript}")
                elif event.type == "response.done":
                    break
    
    asyncio.run(main())
    ```
    
    ## Core Architecture
    
    ### Connection Resources
    
    The `VoiceLiveConnection` exposes these resources:
    
    | Resource | Purpose | Key Methods |
    |----------|---------|-------------|
    | `conn.session` | Session configuration | `update(session=...)` |
    | `conn.response` | Model responses | `create()`, `cancel()` |
    | `conn.input_audio_buffer` | Audio input | `append()`, `commit()`, `clear()` |
    | `conn.output_audio_buffer` | Audio output | `clear()` |
    | `conn.conversation` | Conversation state | `item.create()`, `item.delete()`, `item.truncate()` |
    | `conn.transcription_session` | Transcription config | `update(session=...)` |
    
    ## Session Configuration
    
    ```python
    from azure.ai.voicelive.models import RequestSession, FunctionTool
    
    await conn.session.update(session=RequestSession(
        instructions="You are a helpful voice assistant.",
        modalities=["text", "audio"],
        voice="alloy",  # or "echo", "shimmer", "sage", etc.
        input_audio_format="pcm16",
        output_audio_format="pcm16",
        turn_detection={
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 500
        },
        tools=[
            FunctionTool(
                type="function",
                name="get_weather",
                description="Get current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            )
        ]
    ))
    ```
    
    ## Audio Streaming
    
    ### Send Audio (Base64 PCM16)
    
    ```python
    import base64
    
    # Read audio chunk (16-bit PCM, 24kHz mono)
    audio_chunk = await read_audio_from_microphone()
    b64_audio = base64.b64encode(audio_chunk).decode()
    
    await conn.input_audio_buffer.append(audio=b64_audio)
    ```
    
    ### Receive Audio
    
    ```python
    async for event in conn:
        if event.type == "response.audio.delta":
            audio_bytes = base64.b64decode(event.delta)
            await play_audio(audio_bytes)
        elif event.type == "response.audio.done":
            print("Audio complete")
    ```
    
    ## Event Handling
    
    ```python
    async for event in conn:
        match event.type:
            # Session events
            case "session.created":
                print(f"Session: {event.session}")
            case "session.updated":
                print("Session updated")
            
            # Audio input events
            case "input_audio_buffer.speech_started":
                print(f"Speech started at {event.audio_start_ms}ms")
            case "input_audio_buffer.speech_stopped":
                print(f"Speech stopped at {event.audio_end_ms}ms")
            
            # Transcription events
            case "conversation.item.input_audio_transcription.completed":
                print(f"User said: {event.transcript}")
            case "conversation.item.input_audio_transcription.delta":
                print(f"Partial: {event.delta}")
            
            # Response events
            case "response.created":
                print(f"Response started: {event.response.id}")
            case "response.audio_transcript.delta":
                print(event.delta, end="", flush=True)
            case "response.audio.delta":
                audio = base64.b64decode(event.delta)
            case "response.done":
                print(f"Response complete: {event.response.status}")
            
            # Function calls
            case "response.function_call_arguments.done":
                result = handle_function(event.name, event.arguments)
                await conn.conversation.item.create(item={
                    "type": "function_call_output",
                    "call_id": event.call_id,
                    "output": json.dumps(result)
                })
                await conn.response.create()
            
            # Errors
            case "error":
                print(f"Error: {event.error.message}")
    ```
    
    ## Common Patterns
    
    ### Manual Turn Mode (No VAD)
    
    ```python
    await conn.session.update(session={"turn_detection": None})
    
    # Manually control turns
    await conn.input_audio_buffer.append(audio=b64_audio)
    await conn.input_audio_buffer.commit()  # End of user turn
    await conn.response.create()  # Trigger response
    ```
    
    ### Interrupt Handling
    
    ```python
    async for event in conn:
        if event.type == "input_audio_buffer.speech_started":
            # User interrupted - cancel current response
            await conn.response.cancel()
            await conn.output_audio_buffer.clear()
    ```
    
    ### Conversation History
    
    ```python
    # Add system message
    await conn.conversation.item.create(item={
        "type": "message",
        "role": "system",
        "content": [{"type": "input_text", "text": "Be concise."}]
    })
    
    # Add user message
    await conn.conversation.item.create(item={
        "type": "message",
        "role": "user", 
        "content": [{"type": "input_text", "text": "Hello!"}]
    })
    
    await conn.response.create()
    ```
    
    ## Voice Options
    
    | Voice | Description |
    |-------|-------------|
    | `alloy` | Neutral, balanced |
    | `echo` | Warm, conversational |
    | `shimmer` | Clear, professional |
    | `sage` | Calm, authoritative |
    | `coral` | Friendly, upbeat |
    | `ash` | Deep, measured |
    | `ballad` | Expressive |
    | `verse` | Storytelling |
    
    Azure voices: Use `AzureStandardVoice`, `AzureCustomVoice`, or `AzurePersonalVoice` models.
    
    ## Audio Formats
    
    | Format | Sample Rate | Use Case |
    |--------|-------------|----------|
    | `pcm16` | 24kHz | Default, high quality |
    | `pcm16-8000hz` | 8kHz | Telephony |
    | `pcm16-16000hz` | 16kHz | Voice assistants |
    | `g711_ulaw` | 8kHz | Telephony (US) |
    | `g711_alaw` | 8kHz | Telephony (EU) |
    
    ## Turn Detection Options
    
    ```python
    # Server VAD (default)
    {"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}
    
    # Azure Semantic VAD (smarter detection)
    {"type": "azure_semantic_vad"}
    {"type": "azure_semantic_vad_en"}  # English optimized
    {"type": "azure_semantic_vad_multilingual"}
    ```
    
    ## Error Handling
    
    ```python
    from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed
    
    try:
        async with connect(...) as conn:
            async for event in conn:
                if event.type == "error":
                    print(f"API Error: {event.error.code} - {event.error.message}")
    except ConnectionClosed as e:
        print(f"Connection closed: {e.code} - {e.reason}")
    except ConnectionError as e:
        print(f"Connection error: {e}")
    ```
    
    ## References
    
    - **Detailed API Reference**: See [references/api-reference.md](references/api-reference.md)
    - **Complete Examples**: See [references/examples.md](references/examples.md)
    - **All Models & Types**: See [references/models.md](references/models.md)