Evaluators
Learn how to create and use LLM evaluators to assess your AI agent’s performance
Prerequisites
Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.
LLM as a Judge
What is an LLM Evaluator?
An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:
- Reviews the entire conversation
- Evaluates against specified criteria
- Provides a pass/fail result
- Explains the reasoning behind its decision
Creating an Evaluator
Define Success Criteria
Clearly specify what constitutes a successful interaction. For example:
- “The agent should confirm the appointment date and time”
- “The agent must verify the caller’s name”
- “The agent should handle interruptions politely”
- “The agent must not share sensitive information”
Example Evaluation Criteria
Best Practices
Writing Effective Criteria
-
Be Specific
- Use clear, measurable objectives
- Avoid ambiguous language
- Include specific requirements
-
Focus on Key Behaviors
- Identify critical success factors
- Prioritize important interactions
- Define must-have elements
-
Consider Edge Cases
- Include criteria for handling interruptions
- Address potential misunderstandings
- Cover error scenarios
Example Scenarios
Basic Appointment Confirmation
Complex Medical Scheduling
Understanding Results
Evaluation Output
The LLM evaluator provides:
- A pass/fail status
- A reason explaining the decision
Example Output
Tips for Success
-
Iterate on Criteria
- Start with basic requirements
- Test with different scenarios
- Refine based on results
-
Balance Strictness
- Set reasonable expectations
- Account for natural conversation flow
- Consider multiple valid approaches
-
Review and Adjust
- Monitor evaluation results
- Identify patterns in failures
- Update criteria as needed
Webhook Evaluators
What is a Webhook Evaluator?
A webhook evaluator allows you to implement custom evaluation logic by hosting your own evaluation endpoint. This gives you complete control over the evaluation process and allows for complex, domain-specific evaluation criteria.
Webhook Payload Structure
The webhook will receive a JSON payload containing:
- Input details about the scenario, persona, and data fields
- Output containing the conversation transcript
Example payload:
Implementing a Webhook Evaluator
You can host your webhook evaluator using services like Val Town, which provides a simple way to deploy and run JavaScript functions as webhooks.
Example implementation using Val Town:
Using SDKs for Types
You can use our official SDKs to ensure correct types for the return value:
- JavaScript SDK - Use the
Evaluation
class in the@autoblocks/client/testing
package - Python SDK - Use the
Evaluation
dataclass in theautoblocks.testing.models
package
Best Practices
-
Error Handling
- Implement proper error handling
- Return meaningful error messages
- Log evaluation failures
-
Performance
- Keep evaluation logic efficient
- Handle timeouts appropriately
- Cache expensive computations
-
Testing
- Test with various scenarios
- Verify edge cases
- Monitor evaluation consistency
-
Security
- Secure your endpoint with authentication headers
- Validate incoming requests
- Use environment variables for sensitive data