Understanding Results Structure

Executions vs Test Runs

  • Executions: Individual simulation runs with their specific results
  • Test Runs: Groups of executions bundled together for comparison over time
    • Compare performance across different scenarios
    • Track improvements between iterations
    • Analyze patterns across multiple runs

Reviewing Executions

Execution Details

Each execution provides detailed information about:

  • Timestamp of the run
  • Duration of the conversation
  • Tokens used
  • Input/Output pairs
  • Pass/Fail status
  • Evaluation results

Transcript Review

Review conversations in detail with:

  • Complete conversation transcript
  • Audio playback of the interaction
  • Turn-by-turn message analysis

Performance Metrics

Track important metrics including:

  • Response times
  • Token usage
  • Success rates
  • Evaluation results
  • Overall pass rates

Search Syntax

Use powerful search operators to find specific executions:

Content Search:

  • output:text - Find executions where output contains “text”
  • input:text - Find executions where input contains “text”

Metric Filters:

  • duration>100 - Executions longer than 100ms
  • duration<500 - Executions shorter than 500ms
  • tokens>1000 - Executions using more than 1000 tokens

Combining Searches:

  • Use AND to combine conditions: duration>100 AND output:pass
  • Use OR for alternatives: output:hello OR output:hi
  • Use parentheses for complex queries: (duration>100 AND output:pass) OR input:complete

Visualization Tools

Timeline View

  • Visual representation of execution timing
  • Identify patterns in response times
  • Spot anomalies or performance issues
  • Track conversation flow

Performance Graphs

  • Success rate trends
  • Duration distribution
  • Token usage patterns
  • Data capture accuracy over time

Comparison Tools

Compare executions across:

  • Different personas
  • Time periods
  • Edge cases
  • Data field variations

Best Practices

Analysis Workflow

  1. Review Overall Metrics

    • Check success rates
    • Analyze duration patterns
    • Review token usage
  2. Deep Dive into Failures

    • Examine failed executions
    • Review error patterns
    • Identify common issues
  3. Compare Across Runs

    • Track improvements
    • Identify regressions
    • Analyze pattern changes
  4. Document Findings

    • Note successful strategies
    • Document areas for improvement
    • Track action items

Tips for Effective Analysis

  • Start with high-level metrics
  • Use search to find specific patterns
  • Compare similar scenarios
  • Track improvements over time
  • Document unusual cases
  • Share insights with team