Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.autoblocks.ai/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites

Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.

LLM as a Judge

What is an LLM Evaluator?

An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:
  • Reviews the entire conversation
  • Evaluates against specified criteria
  • Provides a pass/fail result
  • Explains the reasoning behind its decision

Creating an Evaluator

Define Success Criteria

Clearly specify what constitutes a successful interaction. For example:
  • “The agent should confirm the appointment date and time”
  • “The agent must verify the caller’s name”
  • “The agent should handle interruptions politely”
  • “The agent must not share sensitive information”

Example Evaluation Criteria

{
    "name": "Appointment Confirmation Check",
    "criteria": [
        "Agent must confirm the appointment date",
        "Agent must verify patient identity",
        "Agent should maintain professional tone",
        "Agent must handle any scheduling conflicts appropriately"
    ]
}

Best Practices

Writing Effective Criteria

  1. Be Specific
    • Use clear, measurable objectives
    • Avoid ambiguous language
    • Include specific requirements
  2. Focus on Key Behaviors
    • Identify critical success factors
    • Prioritize important interactions
    • Define must-have elements
  3. Consider Edge Cases
    • Include criteria for handling interruptions
    • Address potential misunderstandings
    • Cover error scenarios

Example Scenarios

Basic Appointment Confirmation
{
    "name": "Basic Confirmation",
    "criteria": [
        "Verify appointment date and time",
        "Confirm patient name",
        "End call professionally"
    ]
}
Complex Medical Scheduling
{
    "name": "Medical Scheduling",
    "criteria": [
        "Verify patient identity securely",
        "Confirm appointment type and duration",
        "Check insurance information",
        "Handle scheduling conflicts",
        "Provide preparation instructions"
    ]
}

Understanding Results

Evaluation Output

The LLM evaluator provides:
  • A pass/fail status
  • A reason explaining the decision

Example Output

{
    "status": "fail",
    "reason": "The agent failed to verify the patient's identity before sharing appointment details"
}

Tips for Success

  1. Iterate on Criteria
    • Start with basic requirements
    • Test with different scenarios
    • Refine based on results
  2. Balance Strictness
    • Set reasonable expectations
    • Account for natural conversation flow
    • Consider multiple valid approaches
  3. Review and Adjust
    • Monitor evaluation results
    • Identify patterns in failures
    • Update criteria as needed

Webhook Evaluators

What is a Webhook Evaluator?

A webhook evaluator allows you to implement custom evaluation logic by hosting your own evaluation endpoint. This gives you complete control over the evaluation process and allows for complex, domain-specific evaluation criteria.

Webhook Payload Structure

The webhook will receive a JSON payload containing:
  • Input details about the scenario, persona, and data fields
  • Output containing the conversation transcript
Example payload:
{
  "input": {
    "scenarioName": "Patient Appointment Confirmation",
    "scenarioDescription": "Test how well the agent confirms patient identity and appointment details",
    "personaName": "Impatient Caller",
    "personaDescription": "In a hurry, frequently interrupts, and expresses urgency throughout the call.",
    "edgeCases": [],
    "dataFields": [
      {
        "name": "firstName",
        "description": "Patient's first name",
        "example": "John"
      },
      {
        "name": "lastName",
        "description": "Patient's last name",
        "example": "Doe"
      },
      {
        "name": "dateOfBirth",
        "description": "Patient's date of birth",
        "example": "1990-01-01"
      },
      {
        "name": "appointmentTime",
        "description": "Scheduled appointment time",
        "example": "2024-03-27T14:30:00Z"
      }
    ]
  },
  "output": {
    "messages": [
      {
        "id": "item_BPE8uM8EiTE44GBo6Eewd",
        "timestamp": "2025-04-22T20:04:25.051Z",
        "role": "user",
        "roleLabel": "Your Agent",
        "content": "Hello, how are you?"
      },
      // ... conversation messages ...
    ]
  }
}

Implementing a Webhook Evaluator

You can host your webhook evaluator using services like Val Town, which provides a simple way to deploy and run JavaScript functions as webhooks. Example implementation using Val Town:
// Example webhook evaluator hosted on Val Town
// https://www.val.town/x/autoblocks/Autoblocks_Webhook_Evaluator

import { Evaluation } from "npm:@autoblocks/client/testing";

export default async function httpHandler(request: Request): Promise<Response> {
  if (request.method !== "POST") {
    return Response.json({ message: "Invalid method." }, {
      status: 400,
    });
  }
  try {
    const body = await request.json();

    // Analyze the messages in the body
    // Create an evaluation object

    const response: Evaluation = {
      // Add your score between 0 and 1
      score: 1,
      // Use threshold to detrmine if the evaluation passed or failed.
      threshold: {
        gte: 1,
      },
      metadata: {
        reason: "Add in your reason here.",
      },
    };
    return Response.json(response);
  } catch (e) {
    return Response.json({ message: "Could not evaluate." }, {
      status: 500,
    });
  }
}

Using SDKs for Types

You can use our official SDKs to ensure correct types for the return value:
  • JavaScript SDK - Use the Evaluation class in the @autoblocks/client/testing package
  • Python SDK - Use the Evaluation dataclass in the autoblocks.testing.models package
import type { Evaluation } from "@autoblocks/client/testing";

// implement your evaluation logic here

const evaluation: Evaluation = {
  score: 1,
  threshold: {
    gte: 1,
  },
};

Best Practices

  1. Error Handling
    • Implement proper error handling
    • Return meaningful error messages
    • Log evaluation failures
  2. Performance
    • Keep evaluation logic efficient
    • Handle timeouts appropriately
    • Cache expensive computations
  3. Testing
    • Test with various scenarios
    • Verify edge cases
    • Monitor evaluation consistency
  4. Security
    • Secure your endpoint with authentication headers
    • Validate incoming requests
    • Use environment variables for sensitive data