Prerequisites

Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.

LLM as a Judge

What is an LLM Evaluator?

An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:

  • Reviews the entire conversation
  • Evaluates against specified criteria
  • Provides a pass/fail result
  • Explains the reasoning behind its decision

Creating an Evaluator

Define Success Criteria

Clearly specify what constitutes a successful interaction. For example:

  • “The agent should confirm the appointment date and time”
  • “The agent must verify the caller’s name”
  • “The agent should handle interruptions politely”
  • “The agent must not share sensitive information”

Example Evaluation Criteria

{
    "name": "Appointment Confirmation Check",
    "criteria": [
        "Agent must confirm the appointment date",
        "Agent must verify patient identity",
        "Agent should maintain professional tone",
        "Agent must handle any scheduling conflicts appropriately"
    ]
}

Best Practices

Writing Effective Criteria

  1. Be Specific

    • Use clear, measurable objectives
    • Avoid ambiguous language
    • Include specific requirements
  2. Focus on Key Behaviors

    • Identify critical success factors
    • Prioritize important interactions
    • Define must-have elements
  3. Consider Edge Cases

    • Include criteria for handling interruptions
    • Address potential misunderstandings
    • Cover error scenarios

Example Scenarios

Basic Appointment Confirmation

{
    "name": "Basic Confirmation",
    "criteria": [
        "Verify appointment date and time",
        "Confirm patient name",
        "End call professionally"
    ]
}

Complex Medical Scheduling

{
    "name": "Medical Scheduling",
    "criteria": [
        "Verify patient identity securely",
        "Confirm appointment type and duration",
        "Check insurance information",
        "Handle scheduling conflicts",
        "Provide preparation instructions"
    ]
}

Understanding Results

Evaluation Output

The LLM evaluator provides:

  • A pass/fail status
  • A reason explaining the decision

Example Output

{
    "status": "fail",
    "reason": "The agent failed to verify the patient's identity before sharing appointment details"
}

Tips for Success

  1. Iterate on Criteria

    • Start with basic requirements
    • Test with different scenarios
    • Refine based on results
  2. Balance Strictness

    • Set reasonable expectations
    • Account for natural conversation flow
    • Consider multiple valid approaches
  3. Review and Adjust

    • Monitor evaluation results
    • Identify patterns in failures
    • Update criteria as needed