Prerequisites

Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.

LLM as a Judge

What is an LLM Evaluator?

An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:

  • Reviews the entire conversation
  • Evaluates against specified criteria
  • Provides a pass/fail result
  • Explains the reasoning behind its decision

Creating an Evaluator

Define Success Criteria

Clearly specify what constitutes a successful interaction. For example:

  • “The agent should confirm the appointment date and time”
  • “The agent must verify the caller’s name”
  • “The agent should handle interruptions politely”
  • “The agent must not share sensitive information”

Example Evaluation Criteria

{
    "name": "Appointment Confirmation Check",
    "criteria": [
        "Agent must confirm the appointment date",
        "Agent must verify patient identity",
        "Agent should maintain professional tone",
        "Agent must handle any scheduling conflicts appropriately"
    ]
}

Best Practices

Writing Effective Criteria

  1. Be Specific

    • Use clear, measurable objectives
    • Avoid ambiguous language
    • Include specific requirements
  2. Focus on Key Behaviors

    • Identify critical success factors
    • Prioritize important interactions
    • Define must-have elements
  3. Consider Edge Cases

    • Include criteria for handling interruptions
    • Address potential misunderstandings
    • Cover error scenarios

Example Scenarios

Basic Appointment Confirmation

{
    "name": "Basic Confirmation",
    "criteria": [
        "Verify appointment date and time",
        "Confirm patient name",
        "End call professionally"
    ]
}

Complex Medical Scheduling

{
    "name": "Medical Scheduling",
    "criteria": [
        "Verify patient identity securely",
        "Confirm appointment type and duration",
        "Check insurance information",
        "Handle scheduling conflicts",
        "Provide preparation instructions"
    ]
}

Understanding Results

Evaluation Output

The LLM evaluator provides:

  • A pass/fail status
  • A reason explaining the decision

Example Output

{
    "status": "fail",
    "reason": "The agent failed to verify the patient's identity before sharing appointment details"
}

Tips for Success

  1. Iterate on Criteria

    • Start with basic requirements
    • Test with different scenarios
    • Refine based on results
  2. Balance Strictness

    • Set reasonable expectations
    • Account for natural conversation flow
    • Consider multiple valid approaches
  3. Review and Adjust

    • Monitor evaluation results
    • Identify patterns in failures
    • Update criteria as needed

Webhook Evaluators

What is a Webhook Evaluator?

A webhook evaluator allows you to implement custom evaluation logic by hosting your own evaluation endpoint. This gives you complete control over the evaluation process and allows for complex, domain-specific evaluation criteria.

Webhook Payload Structure

The webhook will receive a JSON payload containing:

  • Input details about the scenario, persona, and data fields
  • Output containing the conversation transcript

Example payload:

{
  "input": {
    "scenarioName": "Patient Appointment Confirmation",
    "scenarioDescription": "Test how well the agent confirms patient identity and appointment details",
    "personaName": "Impatient Caller",
    "personaDescription": "In a hurry, frequently interrupts, and expresses urgency throughout the call.",
    "edgeCases": [],
    "dataFields": [
      {
        "name": "firstName",
        "description": "Patient's first name",
        "example": "John"
      },
      {
        "name": "lastName",
        "description": "Patient's last name",
        "example": "Doe"
      },
      {
        "name": "dateOfBirth",
        "description": "Patient's date of birth",
        "example": "1990-01-01"
      },
      {
        "name": "appointmentTime",
        "description": "Scheduled appointment time",
        "example": "2024-03-27T14:30:00Z"
      }
    ]
  },
  "output": {
    "messages": [
      {
        "id": "item_BPE8uM8EiTE44GBo6Eewd",
        "timestamp": "2025-04-22T20:04:25.051Z",
        "role": "user",
        "roleLabel": "Your Agent",
        "content": "Hello, how are you?"
      },
      // ... conversation messages ...
    ]
  }
}

Implementing a Webhook Evaluator

You can host your webhook evaluator using services like Val Town, which provides a simple way to deploy and run JavaScript functions as webhooks.

Example implementation using Val Town:

// Example webhook evaluator hosted on Val Town
// https://www.val.town/x/autoblocks/Autoblocks_Webhook_Evaluator

import { Evaluation } from "npm:@autoblocks/client/testing";

export default async function httpHandler(request: Request): Promise<Response> {
  if (request.method !== "POST") {
    return Response.json({ message: "Invalid method." }, {
      status: 400,
    });
  }
  try {
    const body = await request.json();

    // Analyze the messages in the body
    // Create an evaluation object

    const response: Evaluation = {
      // Add your score between 0 and 1
      score: 1,
      // Use threshold to detrmine if the evaluation passed or failed.
      threshold: {
        gte: 1,
      },
      metadata: {
        reason: "Add in your reason here.",
      },
    };
    return Response.json(response);
  } catch (e) {
    return Response.json({ message: "Could not evaluate." }, {
      status: 500,
    });
  }
}

Using SDKs for Types

You can use our official SDKs to ensure correct types for the return value:

  • JavaScript SDK - Use the Evaluation class in the @autoblocks/client/testing package
  • Python SDK - Use the Evaluation dataclass in the autoblocks.testing.models package
import type { Evaluation } from "@autoblocks/client/testing";

// implement your evaluation logic here

const evaluation: Evaluation = {
  score: 1,
  threshold: {
    gte: 1,
  },
};

Best Practices

  1. Error Handling

    • Implement proper error handling
    • Return meaningful error messages
    • Log evaluation failures
  2. Performance

    • Keep evaluation logic efficient
    • Handle timeouts appropriately
    • Cache expensive computations
  3. Testing

    • Test with various scenarios
    • Verify edge cases
    • Monitor evaluation consistency
  4. Security

    • Secure your endpoint with authentication headers
    • Validate incoming requests
    • Use environment variables for sensitive data