Evaluators

Prerequisites

Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.

LLM as a Judge

What is an LLM Evaluator?

An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:

Reviews the entire conversation
Evaluates against specified criteria
Provides a pass/fail result
Explains the reasoning behind its decision

Creating an Evaluator

Define Success Criteria

Clearly specify what constitutes a successful interaction. For example:

“The agent should confirm the appointment date and time”
“The agent must verify the caller’s name”
“The agent should handle interruptions politely”
“The agent must not share sensitive information”

Example Evaluation Criteria

{
    "name": "Appointment Confirmation Check",
    "criteria": [
        "Agent must confirm the appointment date",
        "Agent must verify patient identity",
        "Agent should maintain professional tone",
        "Agent must handle any scheduling conflicts appropriately"
    ]
}

Best Practices

Writing Effective Criteria

Be Specific
- Use clear, measurable objectives
- Avoid ambiguous language
- Include specific requirements
Focus on Key Behaviors
- Identify critical success factors
- Prioritize important interactions
- Define must-have elements
Consider Edge Cases
- Include criteria for handling interruptions
- Address potential misunderstandings
- Cover error scenarios

Example Scenarios

Basic Appointment Confirmation

{
    "name": "Basic Confirmation",
    "criteria": [
        "Verify appointment date and time",
        "Confirm patient name",
        "End call professionally"
    ]
}

Complex Medical Scheduling

{
    "name": "Medical Scheduling",
    "criteria": [
        "Verify patient identity securely",
        "Confirm appointment type and duration",
        "Check insurance information",
        "Handle scheduling conflicts",
        "Provide preparation instructions"
    ]
}

Understanding Results

Evaluation Output

The LLM evaluator provides:

A pass/fail status
A reason explaining the decision

Example Output

{
    "status": "fail",
    "reason": "The agent failed to verify the patient's identity before sharing appointment details"
}

Tips for Success

Iterate on Criteria
- Start with basic requirements
- Test with different scenarios
- Refine based on results
Balance Strictness
- Set reasonable expectations
- Account for natural conversation flow
- Consider multiple valid approaches
Review and Adjust
- Monitor evaluation results
- Identify patterns in failures
- Update criteria as needed

Webhook Evaluators

What is a Webhook Evaluator?

A webhook evaluator allows you to implement custom evaluation logic by hosting your own evaluation endpoint. This gives you complete control over the evaluation process and allows for complex, domain-specific evaluation criteria.

Webhook Payload Structure

The webhook will receive a JSON payload containing:

Input details about the scenario, persona, and data fields
Output containing the conversation transcript

Example payload:

{
  "input": {
    "scenarioName": "Patient Appointment Confirmation",
    "scenarioDescription": "Test how well the agent confirms patient identity and appointment details",
    "personaName": "Impatient Caller",
    "personaDescription": "In a hurry, frequently interrupts, and expresses urgency throughout the call.",
    "edgeCases": [],
    "dataFields": [
      {
        "name": "firstName",
        "description": "Patient's first name",
        "example": "John"
      },
      {
        "name": "lastName",
        "description": "Patient's last name",
        "example": "Doe"
      },
      {
        "name": "dateOfBirth",
        "description": "Patient's date of birth",
        "example": "1990-01-01"
      },
      {
        "name": "appointmentTime",
        "description": "Scheduled appointment time",
        "example": "2024-03-27T14:30:00Z"
      }
    ]
  },
  "output": {
    "messages": [
      {
        "id": "item_BPE8uM8EiTE44GBo6Eewd",
        "timestamp": "2025-04-22T20:04:25.051Z",
        "role": "user",
        "roleLabel": "Your Agent",
        "content": "Hello, how are you?"
      },
      // ... conversation messages ...
    ]
  }
}

Implementing a Webhook Evaluator

You can host your webhook evaluator using services like Val Town, which provides a simple way to deploy and run JavaScript functions as webhooks. Example implementation using Val Town:

// Example webhook evaluator hosted on Val Town
// https://www.val.town/x/autoblocks/Autoblocks_Webhook_Evaluator

import { Evaluation } from "npm:@autoblocks/client/testing";

export default async function httpHandler(request: Request): Promise<Response> {
  if (request.method !== "POST") {
    return Response.json({ message: "Invalid method." }, {
      status: 400,
    });
  }
  try {
    const body = await request.json();

    // Analyze the messages in the body
    // Create an evaluation object

    const response: Evaluation = {
      // Add your score between 0 and 1
      score: 1,
      // Use threshold to detrmine if the evaluation passed or failed.
      threshold: {
        gte: 1,
      },
      metadata: {
        reason: "Add in your reason here.",
      },
    };
    return Response.json(response);
  } catch (e) {
    return Response.json({ message: "Could not evaluate." }, {
      status: 500,
    });
  }
}

Using SDKs for Types

You can use our official SDKs to ensure correct types for the return value:

JavaScript SDK - Use the Evaluation class in the @autoblocks/client/testing package
Python SDK - Use the Evaluation dataclass in the autoblocks.testing.models package

import type { Evaluation } from "@autoblocks/client/testing";

// implement your evaluation logic here

const evaluation: Evaluation = {
  score: 1,
  threshold: {
    gte: 1,
  },
};

Best Practices

Error Handling
- Implement proper error handling
- Return meaningful error messages
- Log evaluation failures
Performance
- Keep evaluation logic efficient
- Handle timeouts appropriately
- Cache expensive computations
Testing
- Test with various scenarios
- Verify edge cases
- Monitor evaluation consistency
Security
- Secure your endpoint with authentication headers
- Validate incoming requests
- Use environment variables for sensitive data

Introduction

Demo Apps

Prompt Management

Prompt Snippets

Tracing

Testing

Datasets

Human Review

Workflow Builder

Agent Simulate (Voice)

Role-Based Access Control (RBAC)

LLMs

Evaluators

Prerequisites

LLM as a Judge

What is an LLM Evaluator?

Creating an Evaluator

Define Success Criteria

Example Evaluation Criteria

Best Practices

Writing Effective Criteria

Example Scenarios

Understanding Results

Evaluation Output

Example Output

Tips for Success

Webhook Evaluators

What is a Webhook Evaluator?

Webhook Payload Structure

Implementing a Webhook Evaluator

Using SDKs for Types

Best Practices

Introduction

Demo Apps

Prompt Management

Prompt Snippets

Tracing

Testing

Evaluators

Datasets

Human Review

Workflow Builder

Agent Simulate (Voice)

Role-Based Access Control (RBAC)

LLMs

​Prerequisites

​LLM as a Judge

​What is an LLM Evaluator?

​Creating an Evaluator

​Define Success Criteria

​Example Evaluation Criteria

​Best Practices

​Writing Effective Criteria

​Example Scenarios

​Understanding Results

​Evaluation Output

​Example Output

​Tips for Success

​Webhook Evaluators

​What is a Webhook Evaluator?

​Webhook Payload Structure

​Implementing a Webhook Evaluator

​Using SDKs for Types

​Best Practices

Prerequisites

LLM as a Judge

What is an LLM Evaluator?

Creating an Evaluator

Define Success Criteria

Example Evaluation Criteria

Best Practices

Writing Effective Criteria

Example Scenarios

Understanding Results

Evaluation Output

Example Output

Tips for Success

Webhook Evaluators

What is a Webhook Evaluator?

Webhook Payload Structure

Implementing a Webhook Evaluator

Using SDKs for Types

Best Practices