Simulations
Evaluators
Learn how to create and use LLM evaluators to assess your AI agent’s performance
Prerequisites
Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.
LLM as a Judge
What is an LLM Evaluator?
An LLM evaluator uses a large language model to assess your agent’s performance by analyzing conversation transcripts. The evaluator:
- Reviews the entire conversation
- Evaluates against specified criteria
- Provides a pass/fail result
- Explains the reasoning behind its decision
Creating an Evaluator
Define Success Criteria
Clearly specify what constitutes a successful interaction. For example:
- “The agent should confirm the appointment date and time”
- “The agent must verify the caller’s name”
- “The agent should handle interruptions politely”
- “The agent must not share sensitive information”
Example Evaluation Criteria
Best Practices
Writing Effective Criteria
-
Be Specific
- Use clear, measurable objectives
- Avoid ambiguous language
- Include specific requirements
-
Focus on Key Behaviors
- Identify critical success factors
- Prioritize important interactions
- Define must-have elements
-
Consider Edge Cases
- Include criteria for handling interruptions
- Address potential misunderstandings
- Cover error scenarios
Example Scenarios
Basic Appointment Confirmation
Complex Medical Scheduling
Understanding Results
Evaluation Output
The LLM evaluator provides:
- A pass/fail status
- A reason explaining the decision
Example Output
Tips for Success
-
Iterate on Criteria
- Start with basic requirements
- Test with different scenarios
- Refine based on results
-
Balance Strictness
- Set reasonable expectations
- Account for natural conversation flow
- Consider multiple valid approaches
-
Review and Adjust
- Monitor evaluation results
- Identify patterns in failures
- Update criteria as needed