Back to Skills
    🦞

    azure-ai-evaluation-py

    Azure AI Evaluation SDK for Python.

    By @thegovind
    View on GitHub
    SKILL.md
    ---
    name: azure-ai-evaluation-py
    description: |
      Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators.
      Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
    package: azure-ai-evaluation
    ---
    
    # Azure AI Evaluation SDK for Python
    
    Assess generative AI application performance with built-in and custom evaluators.
    
    ## Installation
    
    ```bash
    pip install azure-ai-evaluation
    
    # With remote evaluation support
    pip install azure-ai-evaluation[remote]
    ```
    
    ## Environment Variables
    
    ```bash
    # For AI-assisted evaluators
    AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
    AZURE_OPENAI_API_KEY=<your-api-key>
    AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
    
    # For Foundry project integration
    AIPROJECT_CONNECTION_STRING=<your-connection-string>
    ```
    
    ## Built-in Evaluators
    
    ### Quality Evaluators (AI-Assisted)
    
    ```python
    from azure.ai.evaluation import (
        GroundednessEvaluator,
        RelevanceEvaluator,
        CoherenceEvaluator,
        FluencyEvaluator,
        SimilarityEvaluator,
        RetrievalEvaluator
    )
    
    # Initialize with Azure OpenAI model config
    model_config = {
        "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
        "api_key": os.environ["AZURE_OPENAI_API_KEY"],
        "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
    }
    
    groundedness = GroundednessEvaluator(model_config)
    relevance = RelevanceEvaluator(model_config)
    coherence = CoherenceEvaluator(model_config)
    ```
    
    ### Quality Evaluators (NLP-based)
    
    ```python
    from azure.ai.evaluation import (
        F1ScoreEvaluator,
        RougeScoreEvaluator,
        BleuScoreEvaluator,
        GleuScoreEvaluator,
        MeteorScoreEvaluator
    )
    
    f1 = F1ScoreEvaluator()
    rouge = RougeScoreEvaluator()
    bleu = BleuScoreEvaluator()
    ```
    
    ### Safety Evaluators
    
    ```python
    from azure.ai.evaluation import (
        ViolenceEvaluator,
        SexualEvaluator,
        SelfHarmEvaluator,
        HateUnfairnessEvaluator,
        IndirectAttackEvaluator,
        ProtectedMaterialEvaluator
    )
    
    violence = ViolenceEvaluator(azure_ai_project=project_scope)
    sexual = SexualEvaluator(azure_ai_project=project_scope)
    ```
    
    ## Single Row Evaluation
    
    ```python
    from azure.ai.evaluation import GroundednessEvaluator
    
    groundedness = GroundednessEvaluator(model_config)
    
    result = groundedness(
        query="What is Azure AI?",
        context="Azure AI is Microsoft's AI platform...",
        response="Azure AI provides AI services and tools."
    )
    
    print(f"Groundedness score: {result['groundedness']}")
    print(f"Reason: {result['groundedness_reason']}")
    ```
    
    ## Batch Evaluation with evaluate()
    
    ```python
    from azure.ai.evaluation import evaluate
    
    result = evaluate(
        data="test_data.jsonl",
        evaluators={
            "groundedness": groundedness,
            "relevance": relevance,
            "coherence": coherence
        },
        evaluator_config={
            "default": {
                "column_mapping": {
                    "query": "${data.query}",
                    "context": "${data.context}",
                    "response": "${data.response}"
                }
            }
        }
    )
    
    print(result["metrics"])
    ```
    
    ## Composite Evaluators
    
    ```python
    from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
    
    # All quality metrics in one
    qa_evaluator = QAEvaluator(model_config)
    
    # All safety metrics in one
    safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)
    
    result = evaluate(
        data="data.jsonl",
        evaluators={
            "qa": qa_evaluator,
            "content_safety": safety_evaluator
        }
    )
    ```
    
    ## Evaluate Application Target
    
    ```python
    from azure.ai.evaluation import evaluate
    from my_app import chat_app  # Your application
    
    result = evaluate(
        data="queries.jsonl",
        target=chat_app,  # Callable that takes query, returns response
        evaluators={
            "groundedness": groundedness
        },
        evaluator_config={
            "default": {
                "column_mapping": {
                    "query": "${data.query}",
                    "context": "${outputs.context}",
                    "response": "${outputs.response}"
                }
            }
        }
    )
    ```
    
    ## Custom Evaluators
    
    ### Code-Based
    
    ```python
    from azure.ai.evaluation import evaluator
    
    @evaluator
    def word_count_evaluator(response: str) -> dict:
        return {"word_count": len(response.split())}
    
    # Use in evaluate()
    result = evaluate(
        data="data.jsonl",
        evaluators={"word_count": word_count_evaluator}
    )
    ```
    
    ### Prompt-Based
    
    ```python
    from azure.ai.evaluation import PromptChatTarget
    
    class CustomEvaluator:
        def __init__(self, model_config):
            self.model = PromptChatTarget(model_config)
        
        def __call__(self, query: str, response: str) -> dict:
            prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
            result = self.model.send_prompt(prompt)
            return {"custom_score": int(result)}
    ```
    
    ## Log to Foundry Project
    
    ```python
    from azure.ai.projects import AIProjectClient
    from azure.identity import DefaultAzureCredential
    
    project = AIProjectClient.from_connection_string(
        conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
        credential=DefaultAzureCredential()
    )
    
    result = evaluate(
        data="data.jsonl",
        evaluators={"groundedness": groundedness},
        azure_ai_project=project.scope  # Logs results to Foundry
    )
    
    print(f"View results: {result['studio_url']}")
    ```
    
    ## Evaluator Reference
    
    | Evaluator | Type | Metrics |
    |-----------|------|---------|
    | `GroundednessEvaluator` | AI | groundedness (1-5) |
    | `RelevanceEvaluator` | AI | relevance (1-5) |
    | `CoherenceEvaluator` | AI | coherence (1-5) |
    | `FluencyEvaluator` | AI | fluency (1-5) |
    | `SimilarityEvaluator` | AI | similarity (1-5) |
    | `RetrievalEvaluator` | AI | retrieval (1-5) |
    | `F1ScoreEvaluator` | NLP | f1_score (0-1) |
    | `RougeScoreEvaluator` | NLP | rouge scores |
    | `ViolenceEvaluator` | Safety | violence (0-7) |
    | `SexualEvaluator` | Safety | sexual (0-7) |
    | `SelfHarmEvaluator` | Safety | self_harm (0-7) |
    | `HateUnfairnessEvaluator` | Safety | hate_unfairness (0-7) |
    | `QAEvaluator` | Composite | All quality metrics |
    | `ContentSafetyEvaluator` | Composite | All safety metrics |
    
    ## Best Practices
    
    1. **Use composite evaluators** for comprehensive assessment
    2. **Map columns correctly** — mismatched columns cause silent failures
    3. **Log to Foundry** for tracking and comparison across runs
    4. **Create custom evaluators** for domain-specific metrics
    5. **Use NLP evaluators** when you have ground truth answers
    6. **Safety evaluators require** Azure AI project scope
    7. **Batch evaluation** is more efficient than single-row loops
    
    ## Reference Files
    
    | File | Contents |
    |------|----------|
    | [references/built-in-evaluators.md](references/built-in-evaluators.md) | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
    | [references/custom-evaluators.md](references/custom-evaluators.md) | Creating code-based and prompt-based custom evaluators, testing patterns |
    | [scripts/run_batch_evaluation.py](scripts/run_batch_evaluation.py) | CLI tool for running batch evaluations with quality, safety, and custom evaluators |