# Add dataset items
Source: https://docs.autoblocks.ai/api-reference/datasets/add-dataset-items

https://api-v2.autoblocks.ai/openapi post /apps/{appSlug}/datasets/{externalId}/items
Add items to a dataset


# Create a dataset
Source: https://docs.autoblocks.ai/api-reference/datasets/create-a-dataset

https://api-v2.autoblocks.ai/openapi post /apps/{appSlug}/datasets
Create a new dataset


# Delete a dataset
Source: https://docs.autoblocks.ai/api-reference/datasets/delete-a-dataset

https://api-v2.autoblocks.ai/openapi delete /apps/{appSlug}/datasets/{externalId}
Delete a dataset


# Delete dataset item
Source: https://docs.autoblocks.ai/api-reference/datasets/delete-dataset-item

https://api-v2.autoblocks.ai/openapi delete /apps/{appSlug}/datasets/{externalId}/items/{itemId}
Delete a dataset item


# Get dataset items
Source: https://docs.autoblocks.ai/api-reference/datasets/get-dataset-items

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/datasets/{externalId}/items
Get items from a dataset


# Get dataset schema by version
Source: https://docs.autoblocks.ai/api-reference/datasets/get-dataset-schema-by-version

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/datasets/{externalId}/schema/{schemaVersion}
Get a dataset schema by version


# List datasets
Source: https://docs.autoblocks.ai/api-reference/datasets/list-datasets

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/datasets
List all datasets for an app


# Update dataset item
Source: https://docs.autoblocks.ai/api-reference/datasets/update-dataset-item

https://api-v2.autoblocks.ai/openapi put /apps/{appSlug}/datasets/{externalId}/items/{itemId}
Update a dataset item


# Get a specific job
Source: https://docs.autoblocks.ai/api-reference/human-review/get-a-specific-job

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/human-review/jobs/{jobId}
Get a specific job by ID


# Get a specific job item
Source: https://docs.autoblocks.ai/api-reference/human-review/get-a-specific-job-item

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/human-review/jobs/{jobId}/items/{itemId}
Get a specific job item by ID


# Get all jobs for the app
Source: https://docs.autoblocks.ai/api-reference/human-review/get-all-jobs-for-the-app

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/human-review/jobs
Get all jobs for the app


# Log trace
Source: https://docs.autoblocks.ai/api-reference/otel/log-trace

https://api-v2.autoblocks.ai/openapi post /otel/v1/traces
Log a trace in the OpenTelemetry format. See https://docs.autoblocks.ai/v2/guides/tracing/overview for more information.


# Create a new prompt
Source: https://docs.autoblocks.ai/api-reference/prompts/create-a-new-prompt

https://api-v2.autoblocks.ai/openapi post /apps/{appSlug}/prompts
Create a new prompt inside of a Prompt app. This prompt can then be used with the Prompt SDKs. See https://docs.autoblocks.ai/v2/guides/prompt-management/overview for more information.


# Get a deployed prompt
Source: https://docs.autoblocks.ai/api-reference/prompts/get-a-deployed-prompt

https://api-v2.autoblocks.ai/openapi get /apps/{appId}/prompts/{externalId}/major/{majorVersion}/minor/{minorVersion}
This endpoint is used by the SDKs to get a deployed prompt inside a Prompt Manager. See https://docs.autoblocks.ai/v2/guides/prompt-management/overview for more information.


# Get an undeployed prompt
Source: https://docs.autoblocks.ai/api-reference/prompts/get-an-undeployed-prompt

https://api-v2.autoblocks.ai/openapi get /apps/{appId}/prompts/{externalId}/major/undeployed/minor/{revisionId}
This endpoint is used by the SDKs to get an undeployed prompt inside a Prompt Manager. See https://docs.autoblocks.ai/v2/guides/prompt-management/overview for more information.


# Get prompt types
Source: https://docs.autoblocks.ai/api-reference/prompts/get-prompt-types

https://api-v2.autoblocks.ai/openapi get /prompts/types
This endpoint is used by the SDKs to generate classes for prompts. See https://docs.autoblocks.ai/v2/guides/prompt-management/overview for more information.


# Valid prompt compatibility
Source: https://docs.autoblocks.ai/api-reference/prompts/valid-prompt-compatibility

https://api-v2.autoblocks.ai/openapi post /apps/{appId}/prompts/{externalId}/revisions/{revisionId}/validate
This endpoint is used by the SDKs to check if a prompt revision is compatible with their currently-configured major version. This is used in the context of tests triggered from the UI, where we override the local config with a prompt revision.


# Generate a message for a scenario
Source: https://docs.autoblocks.ai/api-reference/scenarios/generate-a-message-for-a-scenario

https://api-v2.autoblocks.ai/openapi post /apps/{appSlug}/scenarios/{scenarioId}/generate-message
Generate a message for a scenario based on conversation history


# Get all scenario IDs for the app
Source: https://docs.autoblocks.ai/api-reference/scenarios/get-all-scenario-ids-for-the-app

https://api-v2.autoblocks.ai/openapi get /apps/{appSlug}/scenarios
Get all scenario IDs for the app


# Create a CI build
Source: https://docs.autoblocks.ai/api-reference/testing/create-a-ci-build

https://api-v2.autoblocks.ai/openapi post /testing/builds
Used by the Autoblocks CLI to create a CI build. See https://docs.autoblocks.ai/v2/guides/testing/overview for more information.


# Cloud
Source: https://docs.autoblocks.ai/v2/deployment/cloud

Get started quickly with Autoblocks' hosted cloud offering, the recommended approach for most users.

# Cloud Deployment

Autoblocks offers a fully managed cloud deployment option, providing a quick and easy way to get started. This hosted solution is the recommended approach for most users, as it eliminates the need for infrastructure management and operational overhead.

## Key Benefits

* **Quick Setup**: Get started in minutes with our hosted solution.
* **Managed Infrastructure**: We handle all infrastructure management, updates, and maintenance.
* **Scalability**: Automatically scales based on your usage patterns.
* **High Availability**: Deployed across multiple availability zones for reliability.
* **Security**: Built-in security measures and compliance with industry standards.

## Getting Started

### 1. Sign Up

Visit the [Autoblocks website](https://app-v2.autoblocks.ai) to sign up for an account. The process is straightforward and requires minimal setup.

### 2. Configure Your Environment

Once signed up, you can configure your environment through the Autoblocks dashboard. This includes setting up your organization, projects, and initial configurations.

### 3. Integrate with Your Stack

Autoblocks seamlessly integrates with your existing technology stack. Follow the integration guides for [Python](/v2/tracing/python/quick-start) or [TypeScript](/v2/tracing/typescript/quick-start) to get started.

## Best Practices

* **Regular Monitoring**: Utilize the built-in monitoring tools to track performance and usage.
* **Security Compliance**: Ensure your usage aligns with security best practices and compliance requirements.
* **Scalability Planning**: Plan for scalability to accommodate growth in usage and data volume.

## Next Steps

* [Security and Compliance](/v2/deployment/security-and-compliance)
* [Self-Hosted Deployment](/v2/deployment/self-hosted) (for advanced users requiring full control over their infrastructure)


# Security and Compliance
Source: https://docs.autoblocks.ai/v2/deployment/security-and-compliance

At Autoblocks, security and compliance are top priorities. Autoblocks is committed to ensuring the safety of our customers' data by following industry-standard best practices.

# Security and Compliance at Autoblocks

At Autoblocks, security and compliance are top priorities. Autoblocks is committed to ensuring the safety of our customers' data by following industry-standard best practices.

## Secure Connections

Autoblocks requires that all connections use [SSL/TLS](https://aws.amazon.com/what-is/ssl-certificate) encryption to ensure the confidentiality and integrity of data transmitted between Autoblocks and our customers.

## Secure Hosting

Autoblocks's infrastructure is hosted and managed with Amazon's secure data centers backed by [AWS Cloud Security](https://aws.amazon.com/security/). Amazon continually manages risk and undergoes recurring assessments to ensure compliance with industry standards. For information about AWS data center compliance programs, refer to [AWS Compliance Programs](https://aws.amazon.com/compliance/programs).

Additionally, Autoblocks implements comprehensive edge security measures powered by [Cloudflare's](https://www.cloudflare.com/security) global network.

## Security Reporting

Autoblocks is built with security top of mind, however it is not possible to entirely exclude the existence of security vulnerabilities.

If you identify a security vulnerability, please send an email to [security@autoblocks.ai](mailto:security@autoblocks.ai) with the following details:

* A summary of the vulnerability
* Steps to reproduce the vulnerability
* Possible impact of the vulnerability
* If applicable, any code to exploit the vulnerability

Upon receipt of the report, the Autoblocks team will promptly evaluate and keep you updated on the progress towards a fix.

## SOC 2 Type 2 Compliance

Autoblocks is SOC 2 Type 2 compliant. You may request a compliance report by emailing [security@autoblocks.ai](mailto:security@autoblocks.ai).

## HIPAA Compliance

Autoblocks is HIPAA compliant. You may request a business associate agreement by emailing [security@autoblocks.ai](mailto:security@autoblocks.ai).

## High Availability

The Autoblocks platform uses a multi-availability zone architecture to ensure high availability and fault tolerance. This means that if one availability zone goes down, the platform will continue to operate from another availability zone.

## Disaster Recovery

Autoblocks keeps encrypted backups of all customer data. While never expected, in the event of production data loss or region-wide outage, Autoblocks will restore the latest backup to a new production environment.

## Data Retention

We retain your data based on your organization's subscription plan and in accordance with our [terms of service](https://www.autoblocks.ai/terms). You may request for data to be permanently deleted at any time by contacting [security@autoblocks.ai](mailto:security@autoblocks.ai).

## Data Security

Customer data is encrypted in transit and at rest using industry standard encryption algorithms.

## Access Monitoring

Autoblocks utilizes cloud-specific security tools to continuously monitor access to our infrastructure and applications.

## Questions

If you have any questions about Autoblocks security and compliance or would like a deeper dive into any aspect, please contact [security@autoblocks.ai](mailto:security@autoblocks.ai) and we will be happy to assist you.


# Self-Hosted
Source: https://docs.autoblocks.ai/v2/deployment/self-hosted

Learn how to deploy Autoblocks on your own infrastructure.

Autoblocks offers self-hosted deployments through our partnership with Omnistrate, enabling you to run Autoblocks in your own cloud environment and preferred region. With our bring your own account (BYOA) model, you maintain complete data sovereignty while we handle the operational complexity. Our deployment automatically scales based on your usage patterns, includes automated backups for disaster recovery, and maintains high availability across multiple availability zones—all without requiring any operational overhead from your teams. Our control plane is designed with security in mind, operating with limited, precisely-scoped access to only manage the resources required for your Autoblocks deployment.

## Video Walkthrough

<div
  style={{
position: 'relative',
paddingBottom: '56.25%',
height: 0,
}}
>
  <iframe
    src="https://www.loom.com/embed/6a5d9cede7404a919c95971589d7f964?sid=be77d442-9cf6-4b17-8de1-7cfa7ff1e9fb"
    frameborder="0"
    webkitallowfullscreen
    mozallowfullscreen
    allowfullscreen
    style={{
  position: 'absolute',
  top: 0,
  left: 0,
  width: '100%',
  height: '100%',
}}
  />
</div>

## Integration Security

Our integration follow security best practices with strictly limited permissions scoped to essential infrastructure components:

We maintain a focused set of permissions that only cover necessary AWS services: EC2, EKS, Elastic Load Balancing, VPC, and minimal IAM operations. For components like the AWS Load Balancer Controller, we implement standard open-source IAM policies from the official [Kubernetes SIG repository](https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/install/iam_policy.json).

To ensure these boundaries cannot be exceeded, we implement multiple layers of security controls:

1. A permissions boundary (`OmnistrateBootstrapPermissionsBoundary`) that prevents the creation of any IAM roles or policies beyond the initial permitted set
2. Resource tagging restrictions that limit `iam:PassRole` operations to only Autoblocks-managed roles - `'aws:ResourceTag/omnistrate.com/managed-by': 'omnistrate'`
3. The `OmnistrateInfrastructureProvisioningPolicy` includes explicit conditions preventing access to IAM policies outside our scope

While we can implement additional restrictions based on your security requirements, this may impact our ability to provide comprehensive support and maintenance. Our current permission set represents the optimal balance between security and operational efficiency.

## Deploying to your cloud account

### Connecting your cloud account

Visit the [BYOA portal](https://byoa.autoblocks.ai), create an account, and follow the instructions under [Cloud Accounts](https://byoa.autoblocks.ai/cloud-accounts) to connect your cloud provider account.

![connect-cloud-account](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/self-hosting/connect-cloud-account.png)

After connecting your could account, there will be a modal with a link to run the CloudFormation template to establish a secure connection between your cloud provider account and our control plane.

![connect-modal](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/self-hosting/connect-modal.png)

### Deploying Autoblocks

Once your cloud provider account is configured, you can visit the [Instances](https://byoa.autoblocks.ai/instances) page to deploy Autoblocks to the region of your choice.

![deploy-autoblocks](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/self-hosting/deploy-autoblocks.png)

### Environment Variables

Reach out to [us](mailto:support@autoblocks.ai) to get the environment variables for your deployment.

<Note>
  You will need to create a PostgreSQL database and ensure the deployment k8s cluster has access to it on port 5432.
  The database url environment variable follows the format `postgresql://<username>:<password>@<host>:<port>/<database_name>`.
</Note>

<Note>
  The WorkOS callback url should be in the format `https://<your-custom-webapp-domain>/api/auth/callback`. Where your custom webapp domain is the domain you plan to associate with the web application for your Autoblocks instance.
</Note>

### Custom Domain

Once you have deployed Autoblocks, you can modify the instance to add your custom domains. When you select your recently deployed instance on the [Instances](https://byoa.autoblocks.ai/instances) page, you will see a Custom DNS tab to add your custom domains.

<Note>
  Let [us](mailto:support@autoblocks.ai) know what your custom domain is so we can configure your WorkOS environment to use it.
</Note>

### Monitoring your deployment

You can monitor your deployment by visiting the [Instances](https://byoa.autoblocks.ai/instances) page and selecting your recently deployed instance.

You can also see audit logs of all actions taken on your deployment by visiting the [Audit Logs](https://byoa.autoblocks.ai/audit-logs) page.

### Backups & Recovery

The deployment will automatically create a backup of the database every 6 hours and retains backups for 5 days. You can request custom backup schedules by contacting [us](mailto:support@autoblocks.ai).

### Upgrades

Your deployment will automatically upgrade to the latest version of Autoblocks as we release new versions.

### Deleting your deployment

You can delete your deployment by visiting the [Instances](https://byoa.autoblocks.ai/instances) page and selecting your recently deployed instance.


# Analyze Results
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/analyze-results

Learn how to analyze and understand your simulation results

## Understanding Results Structure

### Executions vs Test Runs

* **Executions**: Individual simulation runs with their specific results
* **Test Runs**: Groups of executions bundled together for comparison over time
  * Compare performance across different scenarios
  * Track improvements between iterations
  * Analyze patterns across multiple runs

## Reviewing Executions

![review-executions](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/review-executions.png)

### Execution Details

Each execution provides detailed information about:

* Timestamp of the run
* Duration of the conversation
* Tokens used
* Input/Output pairs
* Pass/Fail status
* Evaluation results

### Transcript Review

Review conversations in detail with:

* Complete conversation transcript
* Audio playback of the interaction
* Turn-by-turn message analysis

### Performance Metrics

Track important metrics including:

* Response times
* Token usage
* Success rates
* Evaluation results
* Overall pass rates

## Advanced Search

### Search Syntax

Use powerful search operators to find specific executions:

**String Search:**

* `field:value` - Contains search (e.g. `source:aws`, `input:hello`)
* `field!:value` - Not contains search (e.g. `source!:aws`)
* `field=value` - Exact match (e.g. `source=aws`)
* `field!=value` - Not equals (e.g. `source!=aws`)
* `field is:empty` - Check for empty values

**Numeric Search:**

* `duration>100` - Greater than
* `duration<500` - Less than
* `duration>=100` - Greater than or equal
* `duration<=500` - Less than or equal
* `duration=100` - Exact match

**Free Text Search:**

* Simple text search (e.g. `hello`) - Searches across all text fields
* Use `AND` to combine terms (e.g. `hello AND world`)
* Use `OR` for alternatives (e.g. `hello OR world`)
* Use parentheses for grouping (e.g. `(hello OR world) AND test`)

**Combining Searches:**

* Mix and match different operators (e.g. `source:aws AND environment=prod`)
* Use parentheses for complex queries (e.g. `(duration>100 OR duration<=50) AND environment=prod`)
* Combine free text with specific field searches (e.g. `hello AND source:aws`)

**Quoted Strings:**

* Use quotes for multi-word values (e.g. `source:"aws lambda"`)

**Available Fields:**

* `source` - Source of the execution
* `environment` - Environment name
* `input` - Input text
* `output` - Output text
* `message` - Run message
* `duration` - Duration in milliseconds

## Visualization Tools

### Timeline View

* Visual representation of execution timing
* Identify patterns in response times
* Spot anomalies or performance issues
* Track conversation flow

### Performance Graphs

* Success rate trends
* Duration distribution
* Token usage patterns
* Data capture accuracy over time

### Comparison Tools

Compare executions across:

* Different personas
* Time periods
* Edge cases
* Data field variations

![test-runs](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/test-runs.png)

## Best Practices

### Analysis Workflow

1. **Review Overall Metrics**
   * Check success rates
   * Analyze duration patterns
   * Review token usage

2. **Deep Dive into Failures**
   * Examine failed executions
   * Review error patterns
   * Identify common issues

3. **Compare Across Runs**
   * Track improvements
   * Identify regressions
   * Analyze pattern changes

4. **Document Findings**
   * Note successful strategies
   * Document areas for improvement
   * Track action items

### Tips for Effective Analysis

* Start with high-level metrics
* Use search to find specific patterns
* Compare similar scenarios
* Track improvements over time
* Document unusual cases
* Share insights with team


# Evaluators
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/evaluators

Learn how to create and use LLM evaluators to assess your AI agent's performance

## Prerequisites

Before using evaluators, you must configure your OpenAI API key in the settings. This key is required for the LLM-based evaluation functionality.

## LLM as a Judge

### What is an LLM Evaluator?

An LLM evaluator uses a large language model to assess your agent's performance by analyzing conversation transcripts. The evaluator:

* Reviews the entire conversation
* Evaluates against specified criteria
* Provides a pass/fail result
* Explains the reasoning behind its decision

### Creating an Evaluator

#### Define Success Criteria

Clearly specify what constitutes a successful interaction. For example:

* "The agent should confirm the appointment date and time"
* "The agent must verify the caller's name"
* "The agent should handle interruptions politely"
* "The agent must not share sensitive information"

#### Example Evaluation Criteria

```json
{
    "name": "Appointment Confirmation Check",
    "criteria": [
        "Agent must confirm the appointment date",
        "Agent must verify patient identity",
        "Agent should maintain professional tone",
        "Agent must handle any scheduling conflicts appropriately"
    ]
}
```

### Best Practices

#### Writing Effective Criteria

1. **Be Specific**
   * Use clear, measurable objectives
   * Avoid ambiguous language
   * Include specific requirements

2. **Focus on Key Behaviors**
   * Identify critical success factors
   * Prioritize important interactions
   * Define must-have elements

3. **Consider Edge Cases**
   * Include criteria for handling interruptions
   * Address potential misunderstandings
   * Cover error scenarios

#### Example Scenarios

**Basic Appointment Confirmation**

```json
{
    "name": "Basic Confirmation",
    "criteria": [
        "Verify appointment date and time",
        "Confirm patient name",
        "End call professionally"
    ]
}
```

**Complex Medical Scheduling**

```json
{
    "name": "Medical Scheduling",
    "criteria": [
        "Verify patient identity securely",
        "Confirm appointment type and duration",
        "Check insurance information",
        "Handle scheduling conflicts",
        "Provide preparation instructions"
    ]
}
```

### Understanding Results

#### Evaluation Output

The LLM evaluator provides:

* A pass/fail status
* A reason explaining the decision

#### Example Output

```json
{
    "status": "fail",
    "reason": "The agent failed to verify the patient's identity before sharing appointment details"
}
```

### Tips for Success

1. **Iterate on Criteria**
   * Start with basic requirements
   * Test with different scenarios
   * Refine based on results

2. **Balance Strictness**
   * Set reasonable expectations
   * Account for natural conversation flow
   * Consider multiple valid approaches

3. **Review and Adjust**
   * Monitor evaluation results
   * Identify patterns in failures
   * Update criteria as needed

## Webhook Evaluators

### What is a Webhook Evaluator?

A webhook evaluator allows you to implement custom evaluation logic by hosting your own evaluation endpoint. This gives you complete control over the evaluation process and allows for complex, domain-specific evaluation criteria.

### Webhook Payload Structure

The webhook will receive a JSON payload containing:

* Input details about the scenario, persona, and data fields
* Output containing the conversation transcript

Example payload:

```json [expandable]
{
  "input": {
    "scenarioName": "Patient Appointment Confirmation",
    "scenarioDescription": "Test how well the agent confirms patient identity and appointment details",
    "personaName": "Impatient Caller",
    "personaDescription": "In a hurry, frequently interrupts, and expresses urgency throughout the call.",
    "edgeCases": [],
    "dataFields": [
      {
        "name": "firstName",
        "description": "Patient's first name",
        "example": "John"
      },
      {
        "name": "lastName",
        "description": "Patient's last name",
        "example": "Doe"
      },
      {
        "name": "dateOfBirth",
        "description": "Patient's date of birth",
        "example": "1990-01-01"
      },
      {
        "name": "appointmentTime",
        "description": "Scheduled appointment time",
        "example": "2024-03-27T14:30:00Z"
      }
    ]
  },
  "output": {
    "messages": [
      {
        "id": "item_BPE8uM8EiTE44GBo6Eewd",
        "timestamp": "2025-04-22T20:04:25.051Z",
        "role": "user",
        "roleLabel": "Your Agent",
        "content": "Hello, how are you?"
      },
      // ... conversation messages ...
    ]
  }
}
```

### Implementing a Webhook Evaluator

You can host your webhook evaluator using services like [Val Town](https://www.val.town/x/autoblocks/Autoblocks_Webhook_Evaluator), which provides a simple way to deploy and run JavaScript functions as webhooks.

Example implementation using Val Town:

```javascript
// Example webhook evaluator hosted on Val Town
// https://www.val.town/x/autoblocks/Autoblocks_Webhook_Evaluator

import { Evaluation } from "npm:@autoblocks/client/testing";

export default async function httpHandler(request: Request): Promise<Response> {
  if (request.method !== "POST") {
    return Response.json({ message: "Invalid method." }, {
      status: 400,
    });
  }
  try {
    const body = await request.json();

    // Analyze the messages in the body
    // Create an evaluation object

    const response: Evaluation = {
      // Add your score between 0 and 1
      score: 1,
      // Use threshold to detrmine if the evaluation passed or failed.
      threshold: {
        gte: 1,
      },
      metadata: {
        reason: "Add in your reason here.",
      },
    };
    return Response.json(response);
  } catch (e) {
    return Response.json({ message: "Could not evaluate." }, {
      status: 500,
    });
  }
}
```

### Using SDKs for Types

You can use our official SDKs to ensure correct types for the return value:

* [JavaScript SDK](https://github.com/autoblocksai/javascript-sdk) - Use the `Evaluation` class in the `@autoblocks/client/testing` package
* [Python SDK](https://github.com/autoblocksai/python-sdk) - Use the `Evaluation` dataclass in the `autoblocks.testing.models` package

<CodeGroup>
  ```typescript TypeScript
  import type { Evaluation } from "@autoblocks/client/testing";

  // implement your evaluation logic here

  const evaluation: Evaluation = {
    score: 1,
    threshold: {
      gte: 1,
    },
  };
  ```

  ```python Python
  from autoblocks.testing.models import Evaluation
  from autoblocks.testing.models import Threshold

  # implement your evaluation logic here

  evaluation = Evaluation(
    score=1,
    threshold=Threshold(gte=1),
  )
  ```
</CodeGroup>

### Best Practices

1. **Error Handling**
   * Implement proper error handling
   * Return meaningful error messages
   * Log evaluation failures

2. **Performance**
   * Keep evaluation logic efficient
   * Handle timeouts appropriately
   * Cache expensive computations

3. **Testing**
   * Test with various scenarios
   * Verify edge cases
   * Monitor evaluation consistency

4. **Security**
   * Secure your endpoint with authentication headers
   * Validate incoming requests
   * Use environment variables for sensitive data


# Frequently Asked Questions
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/faq


## Billing & Usage

**Q: How does billing work?**
A: You are only charged when you run a simulation. Each simulation run counts as one execution. We recommend starting with simple scenarios and gradually scaling up to understand costs.

**Q: How can I monitor my usage?**
A: You can view your usage details in the Settings page, including:

* Number of simulations run
* Total execution time
* Evaluator usage
* Current billing period stats

**Q: How can I control costs?**
A: We recommend:

1. Start with simple scenarios before scaling up
2. Use test runs to validate your setup
3. Monitor usage in Settings

## Scenarios & Configuration

**Q: How many personas should I include in a scenario?**
A: Start with 1-2 personas that represent your most common use cases. Add more personas once you're comfortable with the results.

**Q: What makes a good scenario description?**
A: The best scenario descriptions are specific and clear about what you're testing. For example: "Test if the agent can correctly schedule a follow-up appointment while verifying patient information."

**Q: Can I reuse personas across different scenarios?**
A: Yes! You can use the same personas across multiple scenarios to ensure consistent testing across different situations.

## Data Fields & Edge Cases

**Q: How many data field variants should I include?**
A: Start with 2-3 common variants for each field. For example, for dates: "March 15, 2023", "3/15/23", and "next Wednesday".

**Q: Which edge cases should I test first?**
A: Begin with common edge cases like:

* Interruptions
* Unclear responses
* Background noise
* Multiple questions at once

## Evaluators

**Q: Do I need an OpenAI API key to use evaluators?**
A: Yes, you need to configure your OpenAI API key in Settings before using evaluators.

**Q: What makes a good evaluation criteria?**
A: Good evaluation criteria should be:

* Specific and measurable
* Focused on one aspect of behavior
* Clear about what constitutes success

**Q: Why did my evaluation fail?**
A: Check that:

1. Your OpenAI API key is valid
2. Your criteria are clear and specific
3. The conversation transcript is complete

## Analysis & Results

**Q: Can I share simulation results with my team?**
A: Yes, you can share specific executions and test runs with team members who have access to your organization.

**Q: How can I search through my simulation results?**
A: Use our search syntax to find specific results:

* `output:text` - Find specific output content
* `duration>100` - Filter by duration

## Troubleshooting

**Q: My simulation isn't starting. What should I check?**
A: Verify that:

1. Your scenario is completely configured
2. All required fields are filled out
3. You have sufficient permissions
4. You're within your usage limits

**Q: The agent isn't responding as expected. What can I do?**
A: Try:

1. Reviewing your scenario description
2. Checking data field variants
3. Simplifying edge cases
4. Running a test simulation with basic settings

**Q: My evaluator keeps failing. How can I fix it?**
A: Ensure:

1. Your OpenAI API key is properly configured
2. Your evaluation criteria are clear and specific
3. The scenario is properly configured
4. You have sufficient API credits

## Getting Help

**Q: How can I get support?**
A: You can:

1. Check these docs for guidance
2. Email [support@autoblocks.ai](mailto:support@autoblocks.ai)
3. Request a shared slack channel with your team


# Overview
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/overview

Learn about Agent Simulate and how it can help test your AI agents

## What is Agent Simulate?

Agent Simulate is a powerful platform for testing and evaluating AI agents in realistic scenarios. It allows you to create simulated environments where your AI agents can interact with virtual personas, helping you understand how they perform in various situations before deploying them to production.

The platform focuses on three key aspects:

1. **Realistic Testing** - Create true-to-life scenarios with diverse personas
2. **Comprehensive Evaluation** - Use LLM-based evaluators to assess performance
3. **Detailed Analysis** - Review conversations, metrics, and trends

![Agent Simulate Overview](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/agent-simulate.png)

## Use Cases

### Voice & Chat Agents

* Appointment scheduling and confirmation
* Customer service interactions
* Information gathering and verification
* Complex multi-turn conversations

### Healthcare Applications

* Patient intake processes
* Appointment management
* Medical information verification
* Follow-up scheduling

### Customer Service

* Support ticket handling
* Product inquiries
* Complaint resolution
* Service scheduling

## Core Concepts

### Scenarios

A scenario is a structured test case that includes:

* Clear objective and success criteria
* One or more personas to interact with
* Data fields to collect or verify
* Edge cases to test robustness

### Personas

Virtual characters that interact with your agent, each with:

* Defined personality traits (e.g., elderly, impatient, distracted)
* Specific behavioral patterns
* Realistic conversation styles
* Common challenges they present

### Data Fields

Information your agent needs to collect or verify:

* Supports multiple input variants
* Handles different formats
* Tests comprehension flexibility
* Validates data capture accuracy

### Edge Cases

Challenging situations to test agent resilience:

* Interruptions and unclear responses
* Background noise and distractions
* Multiple questions at once
* Unexpected behaviors

### Evaluators

LLM-based assessment system that:

* Reviews conversation transcripts
* Provides pass/fail results
* Explains evaluation decisions
* Ensures consistent assessment

### Results Analysis

Comprehensive tools for reviewing performance:

* Full conversation transcripts
* Audio playback
* Success/failure metrics
* Search and filtering capabilities

## Getting Started

1. Create an Autoblocks account
2. Set up your first scenario
3. Configure personas and data fields
4. Add edge cases
5. Run your first simulation

Check our [Quickstart Guide](/v2/guides/agent-simulate/quickstart) for detailed setup instructions.


# Quickstart
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/quick-start

Run your first simulation in 5 minutes

## Prerequisites

Before you begin using Agent Simulate, you'll need:

1. An Autoblocks account
2. OpenAI API key (for evaluators)
3. Basic understanding of your agent's API endpoint

## Create Your First Simulation

### 1. Create a New Scenario

* Navigate to "Simulations" in the sidebar
* Click "New Scenario"
* Name your scenario (e.g., "Appointment Confirmation")
* Add a clear description of what you're testing

![new-scenario](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/new-scenario.png)

### 2. Configure Personas

* Click "Add Persona"
* Choose from pre-built options:
  * Elderly Caller
  * Impatient Caller
  * Distracted Caller
* Start with one persona for your first test

![personas](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/personas.png)

### 3. Set Up Data Fields

* Add fields your agent needs to collect
* For each field:
  * Provide a clear name
  * Add 2-3 common variants
  * Example for dates:
    * "March 15, 2023"
    * "3/15/23"
    * "next Wednesday"

![data-fields](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/data-fields.png)

### 4. Add Edge Cases

* Select relevant edge cases:
  * Interruptions
  * Background noise
  * Unclear responses
* Start with 1-2 edge cases for initial testing

![edge-cases](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/edge-cases.png)

### 5. Run Your Simulation

* Review your configuration
* Click "Test Scenario"
* Enter required information:
  * Phone number to call
  * Description of the test run
* Click "Run Simulation"

![agent-simulate](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/agent-simulate.png)

## Review Results

### 1. Check Execution

* View the conversation transcript
* Listen to the audio recording
* Review success/failure status

### 2. Analyze Performance

* Check if data was captured correctly
* Review any evaluation feedback
* Note areas for improvement

![review-executions](https://mintlify.s3.us-west-1.amazonaws.com/autoblocks/images/review-executions.png)

## Next Steps

1. **Iterate and Improve**
   * Add more personas
   * Include additional edge cases
   * Refine evaluation criteria

2. **Scale Testing**
   * Create variations of your scenario
   * Test different conversation flows
   * Add more complex edge cases

3. **Monitor and Optimize**
   * Track success rates
   * Review usage in Settings
   * Optimize based on results


# Scenarios
Source: https://docs.autoblocks.ai/v2/guides/agent-simulate/scenarios

Learn how to create and manage simulations in Agent Simulate

## Create a Simulation

### Understanding Scenarios

A scenario in Agent Simulate is a structured test case for your AI agent. The scenario name and description play a crucial role:

* **Scenario Name**: Acts as a clear identifier that helps the Autoblocks agent understand the type of test being performed (e.g., "Appointment Confirmation")
* **Scenario Description**: Provides specific context and goals for the test (e.g., "You should test calling a receptionist to confirm your appointment")

These fields help guide the agent's behavior and set expectations for the simulation.

### Scenario Configuration Steps

Creating a scenario involves four main configuration steps:

1. **Personas**: Define who will interact with your agent
2. **Data Fields**: Specify the information to be collected or verified
3. **Edge Cases**: Set up challenging scenarios
4. **Summary**: Review and launch the simulation

### Setting up Personas

Add and configure personas to test different caller types:

**Pre-built Personas include:**

* Elderly Caller
  * Speaks slowly
  * May have difficulty hearing
  * Occasionally repeats themselves

* Impatient Caller
  * In a hurry
  * Frequently interrupts
  * Expresses urgency throughout the call

* Distracted Caller
  * Has significant background noise
  * Seems to be multitasking during the call

You can add multiple personas to test how your agent handles different types of callers. Each persona will generate unique simulation scenarios with specific behavioral cues.

### Configuring Data Fields

Data fields define the information your agent needs to collect or verify:

1. **Field Definition**
   * Give each field a clear name
   * Add a description of what information to collect
   * Specify the expected format

2. **Variant Types**
   * Add different ways people might express the same information
   * Example for dates: "March 15, 2023", "3/15/23", "next Wednesday"
   * Each variant tests your agent's ability to understand different formats

3. **Field Properties**
   * Required vs optional fields
   * Validation rules
   * Dependencies between fields

### Setting up Edge Cases

Edge cases help test how your agent handles challenging scenarios:

**Common Edge Cases:**

* Communication Challenges
  * Frequently interrupts
  * Provides information out of order
  * Has difficulty hearing the agent
  * Speaks very quietly/loudly
  * Has strong accent

* Behavioral Scenarios
  * Hangs up abruptly
  * Is confused about details
  * Asks to repeat information
  * Has background noise
  * Gives inconsistent information

**Custom Edge Cases:**

* Create specific test scenarios
* Define unique behavioral patterns
* Add custom validation rules

### Running the Simulation

Once configured, you can:

1. Review all settings in the Summary view
2. Save the scenario for future use
3. Run the simulation immediately
4. Track progress in real-time
5. Access detailed results and analytics

### Version Control

* Track scenario changes over time
* Easily go back to previous versions

## Best Practices

### Scenario Design

1. **Clear Objectives**
   * Write specific scenario descriptions
   * Define measurable success criteria
   * Focus on one main test goal

2. **Comprehensive Testing**
   * Include both common and edge cases
   * Test various persona combinations
   * Cover different data input formats

3. **Maintainable Structure**
   * Use consistent naming conventions
   * Document special test conditions
   * Keep scenarios focused and modular

### Performance Optimization

* Start with basic scenarios
* Gradually add complexity
* Monitor agent response times
* Analyze success patterns
* Iterate based on results


# Overview
Source: https://docs.autoblocks.ai/v2/guides/datasets/overview

Reference for how Autoblocks allows you to manage test cases and datasets.

# Dataset & Test Case Management

Autoblocks enables you to manage test cases the way you would like to; in code, managed through our web application, or a hybrid approach with both. This gives flexibility to the developer about how and when they want to integrate with Autoblocks.

## Key Features

### Schema Management

* Version-controlled schema definitions
* Automatic versioning of schema changes
* Prevention of breaking changes
* Type safety and validation

### Dataset Organization

* Flexible test case management
* Dataset splits for subset creation
* Version control for datasets
* Integration with test suites

### Integration Options

* TypeScript and Python SDK support
* Web application management
* Hybrid code/UI approach
* CI/CD pipeline integration

## Core Concepts

### Schema Versioning

When creating a dataset for the first time, you will be prompted to build a schema. A schema is the list of properties that each item in the dataset will have. Autoblocks will automatically version the schema every time you update it and prevent you from making any breaking changes.

### Dataset Splits

Splits are a way to divide your dataset into smaller, more manageable pieces. This is useful for creating a subset of your dataset to use for testing different scenarios.

## Next Steps

* [TypeScript SDK Reference](/api-reference/datasets/list-datasets)
* [Python SDK Reference](/api-reference/datasets/list-datasets)


# Out of Box Evaluators
Source: https://docs.autoblocks.ai/v2/guides/evaluators/out-of-box

Learn about the built-in evaluators available in Autoblocks.

Autoblocks provides a set of evaluators that can be used out of the box. These evaluators are designed to be easily integrated into your test suite and can help you get started with testing your AI-powered applications.

Each evaluator below lists the custom properties and methods that need to be implemented to use the evaluator in your test suite.
You must set the `id` property, which is a unique identifier for the evaluator.

All of the code snippets can be run by following the instructions in the [Quick Start](/v2/guides/testing/overview.mdx) guide.

**Ragas**

* [LLM Context Precision With Reference](#out-of-box-evaluators-ragas)
* [Non LLM Context Precision With Reference](#out-of-box-evaluators-ragas)
* [LLM Context Recall](#out-of-box-evaluators-ragas)
* [Non LLM Context Recall](#out-of-box-evaluators-ragas)
* [Context Entities Recall](#out-of-box-evaluators-ragas)
* [Noise Sensitivity](#out-of-box-evaluators-ragas)
* [Response Relevancy](#out-of-box-evaluators-ragas)
* [Faithfulness](#out-of-box-evaluators-ragas)
* [Factual Correctness](#out-of-box-evaluators-ragas)
* [Semantic Similarity](#out-of-box-evaluators-semantic-similarity)

## Logic Based

### Is Equals

The `IsEquals` evaluator checks if the expected output equals the actual output.

Scores 1 if equal, 0 otherwise.

| Name               | Required | Type                            | Description                                    |
| ------------------ | -------- | ------------------------------- | ---------------------------------------------- |
| test\_case\_mapper | Yes      | Callable\[\[BaseTestCase], str] | Map your test case to a string for comparison. |
| output\_mapper     | Yes      | Callable\[\[OutputType], str]   | Map your output to a string for comparison.    |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseIsEquals
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str
      expected_output: str

      def hash(self) -> str:
          return md5(self.input)

  class IsEquals(BaseIsEquals[TestCase, str]):
      id = "is-equals"

      def test_case_mapper(self, test_case: TestCase) -> str:
          return test_case.expected_output

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="hello world",
              expected_output="hello world",
          ),
          TestCase(
              input="hi world",
              expected_output="hello world",
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[IsEquals()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseIsEquals } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
    expectedOutput: string;
  }

  class IsEquals extends BaseIsEquals<TestCase, string> {
    id = 'is-equals';

    outputMapper(args: { output: string }) {
      return args.output;
    }

    testCaseMapper(args: { testCase: TestCase }) {
      return args.testCase.expectedOutput;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
        expectedOutput: 'hello world',
      },
      {
        input: 'hi world',
        expectedOutput: 'hello world',
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new IsEquals()],
  });
  ```
</CodeGroup>

### Is Valid JSON

The `IsValidJSON` evaluator checks if the output is valid JSON.

Scores 1 if it is valid, 0 otherwise.

| Name           | Required | Type                          | Description                                                         |
| -------------- | -------- | ----------------------------- | ------------------------------------------------------------------- |
| output\_mapper | Yes      | Callable\[\[OutputType], str] | Map your output to the string that you want to check is valid JSON. |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseIsValidJSON
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class IsValidJSON(BaseIsValidJSON[TestCase, str]):
      id = "is-valid-json"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="hello world",
          ),
          TestCase(
              input='{"hello": "world"}'
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[IsValidJSON()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseIsValidJSON } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class IsValidJSON extends BaseIsValidJSON<TestCase, string> {
    id = 'is-valid-json';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
      },
      {
        input: '{"hello": "world"}'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new IsValidJSON()],
  });
  ```
</CodeGroup>

### Has All Substrings

The `HasAllSubstrings` evaluator checks if the output contains all the expected substrings.

Scores 1 if all substrings are present, 0 otherwise.

| Name               | Required | Type                                   | Description                                                         |
| ------------------ | -------- | -------------------------------------- | ------------------------------------------------------------------- |
| test\_case\_mapper | Yes      | Callable\[\[BaseTestCase], list\[str]] | Map your test case to a list of strings to check for in the output. |
| output\_mapper     | Yes      | Callable\[\[OutputType], str]          | Map your output to a string for comparison.                         |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseHasAllSubstrings
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str
      expected_substrings: list[str]

      def hash(self) -> str:
          return md5(self.input)

  class HasAllSubstrings(BaseHasAllSubstrings[TestCase, str]):
      id = "has-all-substrings"

      def test_case_mapper(self, test_case: TestCase) -> list[str]:
          return test_case.expected_substrings

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="hello world",
              expected_substrings=["hello", "world"],
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[HasAllSubstrings()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseHasAllSubstrings } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
    expectedSubstrings: string[];
  }

  class HasAllSubstrings extends BaseHasAllSubstrings<TestCase, string> {
    id = 'has-all-substrings';

    outputMapper(args: { output: string }) {
      return args.output;
    }

    testCaseMapper(args: { testCase: TestCase }) {
      return args.testCase.expectedSubstrings;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
        expectedSubstrings: ['hello', 'world'],
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new HasAllSubstrings()],
  });
  ```
</CodeGroup>

### Assertions (Rubric/Rules)

The `Assertions` evaluator enables you to define a set of assertions or rules that your output must satisfy.

<Note>
  Individual assertions can be marked as not required, and if they are not met, the evaluator will still pass.
</Note>

| Name                 | Required | Type                                                         | Description                                      |
| -------------------- | -------- | ------------------------------------------------------------ | ------------------------------------------------ |
| evaluate\_assertions | Yes      | Callable\[\[BaseTestCase, Any], Optional\[List\[Assertion]]] | Implement your logic to evaluate the assertions. |

<CodeGroup>
  ```python Python
  from typing import Optional
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseAssertions
  from autoblocks.testing.models import Assertion
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCaseCriterion:
      criterion: str
      required: bool


  @dataclass
  class TestCase(BaseTestCase):
      input: str
      assertions: Optional[list[TestCaseCriterion]] = None

      def hash(self) -> str:
          return md5(self.input)


  class AssertionsEvaluator(BaseAssertions[TestCase, str]):
      id = "assertions"

      def evaluate_assertions(self, test_case: TestCase, output: str) -> list[Assertion]:
          if test_case.assertions is None:
              return []
          result = []
          for assertion in test_case.assertions:
              result.append(
                  Assertion(
                      criterion=assertion.criterion,
                      passed=assertion.criterion in output,
                      required=assertion.required,
                  )
              )
          return result


  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="hello world",
              assertions=[
                  TestCaseCriterion(criterion="hello", required=True),
                  TestCaseCriterion(criterion="world", required=True),
                  TestCaseCriterion(criterion="hi", required=False),
              ],
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[AssertionsEvaluator()],
  )
  ```

  ```typescript TypeScript
  import {
    runTestSuite,
    BaseAssertions,
    Assertion,
  } from '@autoblocks/client/testing';

  interface TestCaseCriterion {
    criterion: string;
    required: boolean;
  }

  interface TestCase {
    input: string;
    assertions: TestCaseCriterion[];
  }

  class AssertionsEvaluator extends BaseAssertions<TestCase, string> {
    id = 'assertions';

    evaluateAssertions(args: {
      testCase: TestCase;
      output: string;
    }): Assertion[] {
      return args.testCase.assertions.map((assertion) => {
        return {
          criterion: assertion.criterion,
          passed: args.output.includes(assertion.criterion),
          required: assertion.required,
        };
      });
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
        assertions: [
          {
            criterion: 'hello',
            required: true,
          },
          {
            criterion: 'world',
            required: true,
          },
          {
            criterion: 'hi',
            required: false,
          },
        ],
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new AssertionsEvaluator()],
  });
  ```
</CodeGroup>

## LLM Judges

### Custom LLM Judge

The `CustomLLMJudge` evaluator enables you to define custom evaluation criteria using an LLM judge.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to evaluate.                 |
| model                   | No       | str                                   | The OpenAI model to use. Defaults to "gpt-4o".                           |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseCustomLLMJudge
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class CustomLLMJudge(BaseCustomLLMJudge[TestCase, str]):
      id = "custom-llm-judge"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="Hello, how are you?",
          ),
          TestCase(
              input="I hate you!"
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[CustomLLMJudge()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseCustomLLMJudge } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class CustomLLMJudge extends BaseCustomLLMJudge<TestCase, string> {
    id = 'custom-llm-judge';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'Hello, how are you?',
      },
      {
        input: 'I hate you!'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new CustomLLMJudge()],
  });
  ```
</CodeGroup>

### Automatic Battle

The `AutomaticBattle` evaluator enables you to compare two outputs using an LLM judge.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to compare.                  |
| model                   | No       | str                                   | The OpenAI model to use. Defaults to "gpt-4o".                           |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseAutomaticBattle
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class AutomaticBattle(BaseAutomaticBattle[TestCase, str]):
      id = "automatic-battle"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="Hello, how are you?",
          ),
          TestCase(
              input="I hate you!"
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[AutomaticBattle()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseAutomaticBattle } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class AutomaticBattle extends BaseAutomaticBattle<TestCase, string> {
    id = 'automatic-battle';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'Hello, how are you?',
      },
      {
        input: 'I hate you!'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new AutomaticBattle()],
  });
  ```
</CodeGroup>

### Manual Battle

The `ManualBattle` evaluator enables you to compare two outputs using human evaluation.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to compare.                  |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseManualBattle
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class ManualBattle(BaseManualBattle[TestCase, str]):
      id = "manual-battle"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="Hello, how are you?",
          ),
          TestCase(
              input="I hate you!"
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[ManualBattle()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseManualBattle } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class ManualBattle extends BaseManualBattle<TestCase, string> {
    id = 'manual-battle';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'Hello, how are you?',
      },
      {
        input: 'I hate you!'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new ManualBattle()],
  });
  ```
</CodeGroup>

### Accuracy

The `Accuracy` evaluator checks if the output is accurate compared to an expected output.

Scores 1 if accurate, 0.5 if somewhat accurate, 0 if inaccurate.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to check for accuracy.       |
| model                   | No       | str                                   | The OpenAI model to use. Defaults to "gpt-4o".                           |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseAccuracy
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str
      expected_output: str

      def hash(self) -> str:
          return md5(self.input)

  class Accuracy(BaseAccuracy[TestCase, str]):
      id = "accuracy"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="What is the capital of France?",
              expected_output="The capital of France is Paris.",
          ),
          TestCase(
              input="What is the capital of France?",
              expected_output="Paris is the capital of France.",
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[Accuracy()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseAccuracy } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
    expectedOutput: string;
  }

  class Accuracy extends BaseAccuracy<TestCase, string> {
    id = 'accuracy';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'What is the capital of France?',
        expectedOutput: 'The capital of France is Paris.',
      },
      {
        input: 'What is the capital of France?',
        expectedOutput: 'Paris is the capital of France.',
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new Accuracy()],
  });
  ```
</CodeGroup>

### NSFW

The `NSFW` evaluator checks if the output is safe for work.

Scores 1 if safe, 0 otherwise.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to check for NSFW content.   |
| model                   | No       | str                                   | The OpenAI model to use. Defaults to "gpt-4o".                           |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseNSFW
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class NSFW(BaseNSFW[TestCase, str]):
      id = "nsfw"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="Hello, how are you?",
          ),
          TestCase(
              input="Explicit content here"
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[NSFW()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseNSFW } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class NSFW extends BaseNSFW<TestCase, string> {
    id = 'nsfw';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'Hello, how are you?',
      },
      {
        input: 'Explicit content here'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new NSFW()],
  });
  ```
</CodeGroup>

### Toxicity

The `Toxicity` evaluator checks if the output is not toxic.

Scores 1 if it is not toxic, 0 otherwise.

| Name                    | Required | Type                                  | Description                                                              |
| ----------------------- | -------- | ------------------------------------- | ------------------------------------------------------------------------ |
| output\_mapper          | Yes      | Callable\[\[OutputType], str]         | Map your output to the string that you want to check for toxicity.       |
| model                   | No       | str                                   | The OpenAI model to use. Defaults to "gpt-4o".                           |
| num\_overrides          | No       | int                                   | Number of recent evaluation overrides to use as examples. Defaults to 0. |
| example\_output\_mapper | No       | Callable\[\[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output.      |

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseToxicity
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def hash(self) -> str:
          return md5(self.input)

  class Toxicity(BaseToxicity[TestCase, str]):
      id = "toxicity"

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="Hello, how are you?",
          ),
          TestCase(
              input="I hate you!"
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[Toxicity()],
  )
  ```

  ```typescript TypeScript
  import { runTestSuite, BaseToxicity } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  class Toxicity extends BaseToxicity<TestCase, string> {
    id = 'toxicity';

    outputMapper(args: { output: string }) {
      return args.output;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'Hello, how are you?',
      },
      {
        input: 'I hate you!'
      }
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [new Toxicity()],
  });
  ```
</CodeGroup>

## Ragas

[Ragas](https://docs.ragas.io/en/stable/) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines.
We have built wrappers around the metrics to make integration with Autoblocks seamless.

**Available Ragas evaluators:**

* [`BaseRagasLLMContextPrecisionWithReference`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision) uses a LLM to measure the proportion of relevant chunks in the retrieved\_contexts.
* [`BaseRagasNonLLMContextPrecisionWithReference`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_entities_recall) measures the proportion of relevant chunks in the retrieved\_contexts without using a LLM.
* [`BaseRagasLLMContextRecall`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/#llm-based-context-recall) evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.
* [`BaseRagasNonLLMContextRecall`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall) uses non llm string comparison metrics to identify if a retrieved context is relevant or not.
* [`BaseRagasContextEntitiesRecall`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_entities_recall)  evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground\_truths and contexts relative to the number of entities present in the ground\_truths alone.
* [`BaseRagasNoiseSensitivity`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity) measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.
* [`BaseRagasResponseRelevancy`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance) focuses on assessing how pertinent the generated answer is to the given prompt.
* [`BaseRagasFaithfulness`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness) measures the factual consistency of the generated answer against the given context.
* [`BaseRagasFactualCorrectness`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness) compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference.
* [`BaseRagasSemanticSimilarity`](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/semantic_similarity) measures the semantic resemblance between the generated answer and the ground truth.

<Note>
  The Ragas evaluators are only available in the Python SDK. You must install Ragas (`pip install ragas`) before using these evaluators.
  Our wrappers require at least version `0.2.*` of Ragas.
</Note>

| Name                        | Required | Type                                        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| --------------------------- | -------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| id                          | Yes      | str                                         | The unique identifier for the evaluator.                                                                                                                                                                                                                                                                                                                                                                                                                     |
| threshold                   | No       | Threshold                                   | The [threshold](/testing/sdk-reference#threshold) for the evaluation used to determine pass/fail.                                                                                                                                                                                                                                                                                                                                                            |
| llm                         | No       | Any                                         | Custom LLM for the evaluation. Required for any Ragas evaluator that uses a LLM. Read More: [https://docs.ragas.io/en/stable/howtos/customizations/customize\_models/](https://docs.ragas.io/en/stable/howtos/customizations/customize_models/)                                                                                                                                                                                                              |
| embeddings                  | No       | Any                                         | Custom embeddings model for the evaluation. Required for any Ragas evaluator that uses embeddings. Read More: [https://docs.ragas.io/en/stable/howtos/customizations/customize\_models/](https://docs.ragas.io/en/stable/howtos/customizations/customize_models/)                                                                                                                                                                                            |
| mode                        | No       | str                                         | Only applicable for the `BaseRagasFactualCorrectness` evaluator. Read More: [https://docs.ragas.io/en/stable/concepts/metrics/available\_metrics/factual\_correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness)                                                                                                                                                                                               |
| atomicity                   | No       | str                                         | Only applicable for the `BaseRagasFactualCorrectness` evaluator. Read More: [https://docs.ragas.io/en/stable/concepts/metrics/available\_metrics/factual\_correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness)                                                                                                                                                                                               |
| focus                       | No       | str                                         | Only applicable for the `BaseRagasNoiseSensitivity` and `BaseRagasFaithfulness` evaluator. Read More: [https://docs.ragas.io/en/stable/concepts/metrics/available\_metrics/noise\_sensitivity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity) and [https://docs.ragas.io/en/stable/concepts/metrics/available\_metrics/faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness) |
| user\_input\_mapper         | No       | Callable\[\[TestCaseType, OutputType], str] | Map your test case or output to the user input passed to Ragas.                                                                                                                                                                                                                                                                                                                                                                                              |
| response\_mapper            | No       | Callable\[OutputType], str]                 | Map your output to the response passed to Ragas.                                                                                                                                                                                                                                                                                                                                                                                                             |
| reference\_mapper           | No       | Callable\[\[TestCaseType], str]             | Map your test case to the reference passed to Ragas.                                                                                                                                                                                                                                                                                                                                                                                                         |
| retrieved\_contexts\_mapper | No       | Callable\[\[TestCaseType, OutputType], str] | Map your test case and output to the retrieved contexts passed to Ragas.                                                                                                                                                                                                                                                                                                                                                                                     |
| reference\_contexts\_mapper | No       | Callable\[\[TestCaseType], str]             | Map your test case to the reference contexts passed to Ragas.                                                                                                                                                                                                                                                                                                                                                                                                |

<Note>
  Individual Ragas evaluators require different parameters.
  You can find sample implementations for each of the Ragas evaluators [here](https://github.com/autoblocksai/python-sdk/blob/main/tests/autoblocks/test_ragas_evaluators.py).
</Note>

<CodeGroup>
  ```python Python
  from dataclasses import dataclass

  from langchain_openai import ChatOpenAI
  from langchain_openai import OpenAIEmbeddings
  from ragas.embeddings import LangchainEmbeddingsWrapper  # type: ignore[import-untyped]
  from ragas.llms import LangchainLLMWrapper  # type: ignore[import-untyped]

  from autoblocks.testing.evaluators import BaseRagasResponseRelevancy
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.models import Threshold
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      question: str
      expected_answer: str

      def hash(self) -> str:
          return md5(self.question)
      
  @dataclass
  class Output:
      answer: str
      contexts: list[str]

  # You can use any of the Ragas evaluators listed here:
  # https://docs.autoblocks.ai/testing/offline-evaluations#out-of-box-evaluators-ragas
  class ResponseRelevancy(BaseRagasResponseRelevancy[TestCase, Output]):
      id = "response-relevancy"
      threshold = Threshold(gte=1)
      llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
      embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

      def user_input_mapper(self, test_case: TestCase, output: Output) -> str:
          return test_case.question

      def response_mapper(self, output: Output) -> str:
          return output.answer

      def retrieved_contexts_mapper(self, test_case: TestCase, output: Output) -> list[str]:
          return output.contexts

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              question="How tall is the Eiffel Tower?",
              expected_answer="300 meters"
          )
      ],
      fn=lambda test_case: Output(
          answer="300 meters",
          contexts=["The Eiffel tower stands 300 meters tall."],
      ),
      evaluators=[ResponseRelevancy()],
  )
  ```
</CodeGroup>


# Overview
Source: https://docs.autoblocks.ai/v2/guides/evaluators/overview

Learn about Autoblocks Evaluators and how they help you assess your AI applications.

# Evaluators Overview

Autoblocks Evaluators provide a comprehensive system for assessing the quality and performance of your AI applications. They enable you to define custom evaluation criteria and integrate them seamlessly into your development workflow.

## Key Features

### Flexible Evaluation Types

* Rule-based evaluators for simple checks
* LLM-based evaluators for complex assessments
* Webhook evaluators for custom logic
* Out-of-box evaluators for common use cases

### Integration Options

* TypeScript and Python SDK support
* UI-based evaluator creation
* CLI integration
* CI/CD pipeline support

### Rich Evaluation Capabilities

* Custom scoring logic
* Threshold-based pass/fail
* Detailed evaluation metadata
* Evaluation history tracking

## Getting Started

Choose your preferred language to begin:

<CodeGroup>
  <CodeGroupItem title="TypeScript">
    ```typescript
    import { BaseTestEvaluator, Evaluation } from '@autoblocks/client/testing';

    class HasSubstring extends BaseTestEvaluator<MyTestCase, string> {
      id = 'has-substring';

      evaluateTestCase(args: { testCase: MyTestCase; output: string }): Evaluation {
        const score = args.output.includes(args.testCase.expectedSubstring) ? 1 : 0;
        return {
          score,
          threshold: { gte: 1 },
        };
      }
    }
    ```
  </CodeGroupItem>

  <CodeGroupItem title="Python">
    ```python
    from autoblocks.testing.models import BaseTestEvaluator
    from autoblocks.testing.models import Evaluation
    from autoblocks.testing.models import Threshold

    class HasSubstring(BaseTestEvaluator):
      id = "has-substring"

      def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
          score = 1 if test_case.expected_substring in output else 0
          return Evaluation(
              score=score,
              threshold=Threshold(gte=1),
          )
    ```
  </CodeGroupItem>
</CodeGroup>

## Core Concepts

### Evaluator Types

Different approaches to evaluation:

* Rule-based evaluators for simple checks
* LLM judges for complex assessments
* Webhook evaluators for custom logic
* Out-of-box evaluators for common use cases

### Evaluation Components

Key elements of an evaluator:

* Unique identifier
* Scoring logic
* Threshold configuration
* Metadata and documentation

### Integration Methods

Ways to use evaluators:

* SDK integration
* UI-based creation
* CLI execution
* CI/CD pipeline

## Next Steps

* [TypeScript Quick Start](/v2/guides/evaluators/typescript/quick-start)
* [Python Quick Start](/v2/guides/evaluators/python/quick-start)
* [Out of Box Evaluators](/v2/guides/evaluators/out-of-box)


# Python Quick Start
Source: https://docs.autoblocks.ai/v2/guides/evaluators/python/quick-start

Get started with Autoblocks Evaluators in Python.

# Python Quick Start

This guide will help you get started with creating and using evaluators in Python.

## Installation

First, install the Autoblocks client:

```bash
pip install autoblocks-client
```

## Creating an Evaluator

Let's create a simple evaluator that checks if a response contains a specific substring:

```python
from typing import Dict, Any
from autoblocks.testing import BaseTestEvaluator, Evaluation

class HasSubstring(BaseTestEvaluator):
    id = "has-substring"

    def evaluate_test_case(self, test_case: Dict[str, Any], output: str) -> Evaluation:
        score = 1 if test_case["expected_substring"] in output else 0
        return Evaluation(score=score, threshold={"gte": 1})
```

## Using Out of Box Evaluators

Autoblocks provides several [out-of-box evaluators](/v2/guides/evaluators/out-of-box) that you can use directly:

```python
from typing import Dict, Any
from autoblocks.testing import BaseAccuracy

class Accuracy(BaseAccuracy):
    id = "accuracy"

    def output_mapper(self, output: str) -> str:
        return output

    def expected_output_mapper(self, test_case: Dict[str, Any]) -> str:
        return test_case["expected_output"]
```

## Running Evaluations

You can run evaluations using the test suite:

```python
from typing import Dict, Any
from autoblocks.testing import run_test_suite

async def main():
    await run_test_suite(
        id="my-test-suite",
        test_cases=[
            {
                "input": "hello world",
                "expected_output": "hello world",
            }
        ],
        test_case_hash=["input"],
        fn=lambda test_case: test_case["input"],
        evaluators=[Accuracy()],
    )

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())
```

## Next Steps

* [Out of Box Evaluators](/v2/guides/evaluators/out-of-box)
* [Testing SDK](/v2/guides/testing/python/quick-start)


# TypeScript Quick Start
Source: https://docs.autoblocks.ai/v2/guides/evaluators/typescript/quick-start

Get started with Autoblocks Evaluators in TypeScript.

# TypeScript Quick Start

This guide will help you get started with creating and using evaluators in TypeScript.

## Installation

First, install the Autoblocks client:

```bash
npm install @autoblocks/client
```

## Creating an Evaluator

Let's create a simple evaluator that checks if a response contains a specific substring:

```typescript
import { BaseTestEvaluator, Evaluation } from '@autoblocks/client/testing';

interface MyTestCase {
  input: string;
  expectedSubstring: string;
}

class HasSubstring extends BaseTestEvaluator<MyTestCase, string> {
  id = 'has-substring';

  evaluateTestCase(args: { testCase: MyTestCase; output: string }): Evaluation {
    const score = args.output.includes(args.testCase.expectedSubstring) ? 1 : 0;
    return {
      score,
      threshold: { gte: 1 },
    };
  }
}
```

## Using Out of Box Evaluators

Autoblocks provides several [out-of-box evaluators](/v2/guides/evaluators/out-of-box) that you can use directly:

```typescript
import { BaseAccuracy } from '@autoblocks/client/testing';

class Accuracy extends BaseAccuracy<MyTestCase, string> {
  id = 'accuracy';

  outputMapper(args: { output: string }): string {
    return args.output;
  }

  expectedOutputMapper(args: { testCase: MyTestCase }): string {
    return args.testCase.expectedOutput;
  }
}
```

## Running Evaluations

You can run evaluations using the test suite:

```typescript
import { runTestSuite } from '@autoblocks/client/testing';

runTestSuite<MyTestCase, string>({
  id: 'my-test-suite',
  testCases: [
    {
      input: 'hello world',
      expectedOutput: 'hello world',
    },
  ],
  testCaseHash: ['input'],
  fn: ({ testCase }) => testCase.input,
  evaluators: [new Accuracy()],
});
```

## Next Steps

* [Out of Box Evaluators](/v2/evaluators/out-of-box)
* [Testing SDK](/v2/guides/testing/typescript/quick-start)


# Overview
Source: https://docs.autoblocks.ai/v2/guides/human-review/overview

Human review mode is designed for humans to review, grade, and discuss test results..

## Formatting your test cases and outputs

The schemas of your test case and output as they exist in your codebase often contain implementation details that are not relevant to a human reviewer.
Each SDK provides methods that allow you to transform your test cases and outputs into human-readable formats.

<CodeGroup>
  ```python python
  from dataclasses import dataclass
  from uuid import UUID
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.models import HumanReviewField
  from autoblocks.testing.models import HumanReviewFieldContentType
  from autoblocks.testing.util import md5

  @dataclass
  class Document:
      uuid: UUID  # Not relevant for human review, so we don't include it below
      title: str
      content: str

  @dataclass
  class MyCustomTestCase(BaseTestCase):
      user_question: str
      documents: list[Document]

      def hash(self) -> str:
          return md5(self.user_question)

      def serialize_for_human_review(self) -> list[HumanReviewField]:
          return [
              HumanReviewField(
                  name="Question",
                  value=self.user_question,
                  content_type=HumanReviewFieldContentType.TEXT,
              ),
          ] + [
              HumanReviewField(
                  name=f"Document {i + 1}: {doc.title}",
                  value=doc.content,
                  content_type=HumanReviewFieldContentType.TEXT,
              )
              for i, doc in enumerate(self.documents)
          ]

  @dataclass
  class MyCustomOutput:
      answer: str
      reason: str

      # These fields are implementation details not needed
      # for human review, so they will be omitted below
      x: int
      y: int
      z: int

      def serialize_for_human_review(self) -> list[HumanReviewField]:
          return [
              HumanReviewField(
                  name="Answer",
                  value=self.answer,
                  content_type=HumanReviewFieldContentType.TEXT
              ),
              HumanReviewField(
                  name="Reason",
                  value=self.reason,
                  content_type=HumanReviewFieldContentType.TEXT,
              ),
          ]
  ```

  ```typescript typescript
  import { runTestSuite, HumanReviewFieldContentType } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
      },
    ],
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input,
    evaluators: [],
    serializeTestCaseForHumanReview: (testCase) => [
      {
        name: 'Input',
        value: testCase.input,
        contentType: HumanReviewFieldContentType.TEXT,
      },
    ],
    serializeOutputForHumanReview: (output) => [
      {
        name: 'Output',
        value: output,
        contentType: HumanReviewFieldContentType.TEXT,
      },
    ],
  });
  ```
</CodeGroup>

<Note>
  There are four different content types you can use to control the rendering in the Autoblocks UI:

  * TEXT
  * HTML
  * MARKDOWN
  * LINK
</Note>

<Note>
  This is often a good starting point when setting up a test suite for the first time.
  Developers can run the test without any code-based evaluators and review the results manually
  to understand the responses being generated by the LLM.
</Note>

## Creating a human review job programmatically

Whether you are on the free plan or a paid plan, you can create human review jobs directly in code with either the `RunManager` or in `runTestSuite`.

### run\_test\_suite / runTestSuite

<CodeGroup>
  ```python python
  from dataclasses import dataclass

  from autoblocks.testing.evaluators import BaseHasAllSubstrings
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.models import CreateHumanReviewJob
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class TestCase(BaseTestCase):
      input: str
      expected_substrings: list[str]

      def hash(self) -> str:
          return md5(self.input) # Unique identifier for a test case

  class HasAllSubstrings(BaseHasAllSubstrings[TestCase, str]):
      id = "has-all-substrings"

      def test_case_mapper(self, test_case: TestCase) -> list[str]:
          return test_case.expected_substrings

      def output_mapper(self, output: str) -> str:
          return output

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          TestCase(
              input="hello world",
              expected_substrings=["hello", "world"],
          )
      ], # Replace with your test cases
      fn=lambda test_case: test_case.input, # Replace with your LLM call
      evaluators=[HasAllSubstrings()], # Replace with your evaluators
      human_review_job=CreateHumanReviewJob(
          assignee_email_address="example@example.com",
          name="Review for accuracy",
      )
  )
  ```

  ```typescript typescript
  import { runTestSuite, BaseHasAllSubstrings } from '@autoblocks/client/testing';

  interface TestCase {
    input: string;
    expectedSubstrings: string[];
  }

  class HasAllSubstrings extends BaseHasAllSubstrings<TestCase, string> {
    id = 'has-all-substrings';

    outputMapper(args: { output: string }) {
      return args.output;
    }

    testCaseMapper(args: { testCase: TestCase }) {
      return args.testCase.expectedSubstrings;
    }
  }

  runTestSuite<TestCase, string>({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
        expectedSubstrings: ['hello', 'world'],
      },
    ], // Replace with your test cases
    testCaseHash: ['input'],
    fn: ({ testCase }) => testCase.input, // Replace with your LLM call
    evaluators: [new HasAllSubstrings()], // Replace with your evaluators
    humanReviewJob: {
      assigneeEmailAddress: 'example@example.com',
      name: 'Review for accuracy',
    },
  });
  ```
</CodeGroup>

### Run Manager

<CodeGroup>
  ```python python
  from dataclasses import dataclass

  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.models import HumanReviewField
  from autoblocks.testing.models import HumanReviewFieldContentType
  from autoblocks.testing.run import RunManager
  from autoblocks.testing.util import md5


  # Update with your test case type
  @dataclass
  class TestCase(BaseTestCase):
      input: str

      def serialize_for_human_review(self) -> list[HumanReviewField]:
          return [
              HumanReviewField(
                  name="Input",
                  value=self.input,
                  content_type=HumanReviewFieldContentType.TEXT,
              ),
          ]

      def hash(self) -> str:
          return md5(self.input)


  # Update with your output type
  @dataclass
  class Output:
      output: str

      def serialize_for_human_review(self) -> list[HumanReviewField]:
          return [
              HumanReviewField(
                  name="Output",
                  value=self.output,
                  content_type=HumanReviewFieldContentType.TEXT,
              ),
          ]


  run = RunManager[TestCase, Output](
      test_id="test-id",
  )

  run.start()
  # Add results from your test suite here
  run.add_result(
      test_case=TestCase(input="Hello, world!"),
      output=Output(output="Hi, world!"),
  )
  run.end()

  run.create_human_review_job(
      assignee_email_address="${emailAddress}",
      name="Review for accuracy",
  )
  ```

  ```typescript typescript
  import {
    HumanReviewFieldContentType,
    RunManager,
  } from '@autoblocks/client/testing';

  // Update with your test case and output type
  interface TestCase {
    input: string;
  }

  interface Output {
    output: string;
  }

  const main = async () => {
    const runManager = new RunManager<TestCase, Output>({
      testId: 'test-id',
      testCaseHash: ['input'],
      serializeTestCaseForHumanReview: (testCase) => [
        {
          type: HumanReviewFieldContentType.TEXT,
          value: testCase.input,
          name: 'input',
        },
      ],
      serializeOutputForHumanReview: ({ output }) => [
        { type: HumanReviewFieldContentType.TEXT, value: output, name: 'output' },
      ],
    });

    await runManager.start();
    // Add results from your test suite here
    await runManager.addResult({
      testCase: { input: 'Hello, world!' },
      output: { output: 'Hi, world!' },
      evaluations: [], // Add any automated evaluations
    });
    await runManager.end();

    // Create a human review job for the test run
    await runManager.createHumanReviewJob({
      assigneeEmailAddress: 'example@example.com',
      name: 'Review for accuracy',
    });
  };

  main();
  ```
</CodeGroup>

## Using the results

You can use the results of a human review job for a variety of purposes, such as:

* Fine tuning an evaluation model
* Few shot examples in your LLM judges
* Improving your core product based on expert feedback
* and more!


# Core Concepts
Source: https://docs.autoblocks.ai/v2/guides/introduction/core-concepts

Learn about the fundamental concepts that power Autoblocks, including apps, test cases, evaluations, human review, and agent simulation.

# Core Concepts

Understanding these core concepts will help you leverage Autoblocks effectively.

## Apps

Apps are the primary organizational units in Autoblocks, grouping related resources such as prompts, test cases, and evaluations. Apps typically align with specific use cases or business objectives, enabling clear organization, access control, usage tracking, and consistent evaluation standards.

## Test Cases & Datasets

Test cases and datasets evaluate AI models against realistic scenarios:

* **Dynamic Generation**: Automatically generate test cases from real user inputs.
* **Manual and Programmatic Creation**: Create test cases manually or via SDK.
* **Versioning and Collaboration**: Track changes and collaborate effectively.

## Evaluations

Evaluations measure AI performance comprehensively:

* **Automated Checks**: Programmatic validation of outputs.
* **SME-Aligned Metrics**: Incorporate expert feedback directly into evaluation logic.
* **Continuous Improvement**: Iterative enhancements based on evaluation outcomes.

## Human Review

Integrate domain expertise seamlessly:

* **Structured Workflows**: Efficiently gather and apply expert feedback.
* **Quality Assurance**: Ensure outputs meet high standards before deployment.

## Agent Simulation

Realistically test AI systems:

* **Scenario and Environment Simulation**: Create realistic testing scenarios.
* **Edge Case Identification**: Proactively discover and address potential issues.

## Workflow Builder

Efficiently manage complex testing processes:

* **Visual Interface**: Easily create and manage workflows.
* **Flexible Execution**: Run workflows on Autoblocks infrastructure or your own.

## Prompt Management

Optimize AI reliability through effective prompt management:

* **Version Control and A/B Testing**: Continuously refine prompts.
* **Performance Monitoring**: Track and enhance prompt effectiveness.

## Tracing

Gain deep insights into AI behavior:

* **Comprehensive Tracking**: Monitor interactions end-to-end.
* **Error Detection and Analytics**: Quickly identify and resolve issues.

## Integration

Seamlessly integrate Autoblocks into your existing workflows:

* **SDKs and APIs**: Easy integration with Python, TypeScript, and REST APIs.
* **CI/CD and Monitoring**: Integrate with your existing development and observability tools.


# What is Autoblocks?
Source: https://docs.autoblocks.ai/v2/guides/introduction/what-is-autoblocks

Ship AI apps you can trust. Autoblocks helps AI product teams prototype, test, and launch reliable apps & agents faster and at scale.

## Key Features

Autoblocks empowers AI product teams to prototype, test, and launch reliable AI applications and agents quickly and confidently. Designed for high-stakes industries, Autoblocks ensures your AI models are robust, compliant, and aligned with real-world business outcomes.

### Dynamic Test Case Generation

Automatically generate test cases from real user inputs, efficiently capturing critical edge cases and scenarios.

### SME-Aligned Evaluation Metrics

Integrate subject matter expert (SME) feedback directly into your evaluation pipeline, ensuring AI behavior aligns with real-world standards.

### Continuous Improvement Loop

Close the feedback loop between testing, SME insights, and production data, enabling continuous improvement of your AI agents.

### Red-Teaming & Simulation Tooling

Rapidly simulate thousands of real-world interactions to proactively identify and address potential risks and edge cases.

### Compliance and Security

Maintain enterprise-level security and compliance with industry standards such as HIPAA and SOC 2 Type 2, safeguarding sensitive data.

### Seamless Integration

Easily integrate Autoblocks into your existing technology stack without disruption, enhancing your current workflows and infrastructure.

## How Autoblocks Works

1. **Connect**: Integrate your existing AI agents, models, prompts, and evaluation logic.
2. **Test**: Define, import, or automatically generate test cases using real-world data.
3. **Align SMEs**: Engage SMEs to review outputs and provide structured feedback.
4. **Review & Deploy**: Analyze insights from comprehensive dashboards, iterate efficiently, and deploy optimal solutions.
5. **Monitor & Iterate**: Continuously monitor performance, automatically update test sets and evaluation metrics, and iteratively enhance your AI agents.

## Why Choose Autoblocks?

* **Accelerate Deployment**: Move beyond manual QA and brittle test scripts to rapidly deploy reliable AI.
* **Minimize Risk**: Proactively identify and mitigate risks before deployment.
* **Enhance Quality**: Leverage expert insights to continuously improve AI performance.
* **Scale Efficiently**: Automate testing and evaluation processes at scale.
* **Ensure Compliance**: Consistently meet industry regulations and standards.


# Overview
Source: https://docs.autoblocks.ai/v2/guides/prompt-management/overview

Learn about Autoblocks Prompt Management and how it helps you manage and version your prompts with type safety and autocomplete.

## Key Features

Autoblocks Prompt Management provides a robust system for managing, versioning, and executing your prompts with full type safety and autocomplete support. It enables you to maintain a single source of truth for your prompts while supporting multiple environments and deployment strategies.

### Type-Safe Prompt Management

* Autogenerated prompt classes with full type safety
* IDE autocomplete for templates and parameters
* Runtime validation of prompt inputs
* Support for both TypeScript and Python

### Version Control

* Semantic versioning for prompts (major.minor)
* Support for latest version tracking
* Undeployed revision support for local development
* Automatic background refresh of latest versions

### Integration Flexibility

* TypeScript and Python SDK support
* CLI for prompt generation
* CI/CD integration
* Local development support

## Getting Started

Choose your preferred language to begin:

<CodeGroup>
  <CodeGroupItem title="TypeScript">
    ```typescript
    import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

    // Initialize the prompt manager
    const mgr = new AutoblocksPromptManager({
      appName: 'my-app',
      id: 'text-summarization',
      version: {
        major: '1',
        minor: 'latest',
      },
    });

    // Execute a prompt with type safety
    const response = await mgr.exec(async ({ prompt }) => {
      const params = {
        model: prompt.params.model,
        messages: [
          {
            role: 'system',
            content: prompt.renderTemplate({
              template: 'system',
              params: {
                language: 'Spanish',
              },
            }),
          },
        ],
      };
      return await openai.chat.completions.create(params);
    });
    ```
  </CodeGroupItem>

  <CodeGroupItem title="Python">
    ```python
    from my_project.autoblocks_prompts import my_app

    # Initialize the prompt manager
    mgr = my_app.text_summarization_prompt_manager(
      major_version="1",
      minor_version="latest",
    )

    # Execute a prompt with type safety
    with mgr.exec() as prompt:
        params = {
            "model": prompt.params.model,
            "messages": [
                {
                    "role": "system",
                    "content": prompt.render_template.system(
                        language="Spanish",
                    ),
                },
            ],
        }
        response = openai.chat.completions.create(**params)
    ```
  </CodeGroupItem>
</CodeGroup>

## Core Concepts

### Prompt Managers

The interface for working with prompts:

* Initialize with version specifications
* Support for latest version tracking
* Background refresh capabilities
* Type-safe prompt execution

### Prompt Execution

Safe and type-checked prompt usage:

* Context managers for execution
* Template rendering with validation
* Parameter access with type safety
* Error handling and logging

### Version Management

Flexible version control:

* Major version pinning
* Latest minor version support
* Undeployed revision access
* Background refresh options

## Next Steps

* [TypeScript Quick Start](/v2/guides/prompt-management/typescript/quick-start)
* [Python Quick Start](/v2/guides/prompt-management/python/quick-start)


# Quick Start
Source: https://docs.autoblocks.ai/v2/guides/prompt-management/python/quick-start

Get started with the Autoblocks Python Prompt SDK to manage and execute your prompts with type safety and autocomplete.

# Python Prompt SDK Quick Start

## Installation

<CodeGroup>
  ```bash poetry
  poetry add autoblocksai
  ```

  ```bash pip
  pip install autoblocksai
  ```
</CodeGroup>

<Note>
  The prompt SDK requires [`pydantic`](https://docs.pydantic.dev/latest/) v2 to be installed.

  You can install it with `poetry add pydantic` or `pip install pydantic`.
</Note>

## Create a Prompt App

Before creating prompts, you need to create a prompt app. Apps are the top-level organizational unit in Autoblocks and help you manage access and track usage of your prompts.

1. Go to the [apps page](https://app-v2.autoblocks.ai/apps)
2. Click "Create App"
3. Select "Prompt App" as the app type
4. Give your app a name and description
5. Configure access settings for your team members

<Note>
  All prompts you create will be associated with this app. Make sure to choose a name that reflects the purpose of your prompts (e.g., "Content Generation" or "Customer Support").
</Note>

## Autogenerate Prompt Classes

The prompt SDK ships with a CLI that generates Python classes with methods and arguments
that mirror the structure of your prompt's templates and parameters.
This gives you type safety and autocomplete when working with Autoblocks prompts in your codebase.

### Set Your API Key

Get your Autoblocks API key from the [settings](https://app-v2.autoblocks.ai/settings/api-keys)
page and set it as an environment variable:

```bash
export AUTOBLOCKS_API_KEY=...
```

### Run the CLI

Installing the `autoblocksai` package adds the `prompts generate_v2` CLI to your path:

```bash
poetry run prompts generate-v2 --output-dir my_project/autoblocks_prompts
```

Running the CLI will create a file at the `outfile-dir` location you have configured.
You will need to run the `prompts generate-v2` CLI any time you deploy a new major version of a prompt.

<Note>
  When a new **major version** of a prompt is available and you want to update your codebase to use it,
  the process will be:

  * run `prompts generate-v2`
  * update any broken code
</Note>

<Note>
  If you're not using `poetry`, make sure to activate the virtualenv where the `autoblocksai` package is installed
  so that the `prompts generate-v2` CLI can be found.
</Note>

## Import and Use a Prompt Manager

For each application in your organization, there will be a factory named after the application slug.
For example, if the application slug is `"my_app"`, then you can import like `from autoblocks_prompts import my_app`.
Using that factory, you can initialize a prompt manager for each prompt in the application. If you have a prompt with
the ID `"text-summarization"`, then you can initialize a prompt manager like `my_app.text_summarizatio_prompt_manager`.

## Initialize the Prompt Manager

Create a single instance of the prompt manager for the lifetime of your application.
The only required argument when initializing a prompt manager is the minor version.
To specify the minor version, use the enum that was autogenerated by the CLI:

<CodeGroup>
  ```python specific minor version
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="1",
    minor_version="0",
  )
  ```

  ```python latest minor version
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="1",
    minor_version="latest",
  )
  ```

  ```python specific revision
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="undeployed",
    minor_version="clvods6wq0003m44zc8sizv2l",
  )
  ```

  ```python latest undeployed revision
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="undeployed",
    minor_version="latest",
  )
  ```
</CodeGroup>

<Note>
  When the version is set to `"latest"`, the prompt manager periodically refresh the in-memory prompt
  in the background according to the `refresh_interval`.
  See the [`AutoblocksPromptManager`](/v2/guides/prompt-management/python/sdk-reference#autoblocks-prompt-manager) reference for more information.
</Note>

## Execute a Prompt

The `exec` method on the prompt manager starts a new prompt execution context.
It is a context manager that creates a [`PromptExecutionContext`](/v2/guides/prompt-management/python/sdk-reference#prompt-execution-context)
instance that gives you fully-typed access to the prompt's templates and parameters:

```python
with mgr.exec() as prompt:
    params = dict(
        model=prompt.params.model,
        temperature=prompt.params.temperature,
        max_tokens=prompt.params.max_tokens,
        messages=[
            dict(
                role="system",
                content=prompt.render_template.system(
                    language_requirement=prompt.render_template.util_language(
                        language="Spanish",
                    ),
                    tone_requirement=prompt.render_template.util_tone(
                        tone="formal",
                    ),
                ),
            ),
            dict(
                role="user",
                content=prompt.render_template.user(
                  document="mock document",
                ),
            ),
        ],
    )

    response = openai.chat.completions.create(**params)
```

## Organizing Multiple Prompt Managers

If you are using many prompt managers, we recommend initializing them in a single file and importing them as a module:

`prompt_managers.py`:

```python
from my_project.autoblocks_prompts import my_app

text_summarization = my_app.text_summarization_prompt_manager(
  major_version="1",
  minor_version="0",
)

flashcard_generator = my_app.flashcard_generator_prompt_manager(
  major_version="1",
  minor_version="0",
)

study_guide_outline = my_app.study_guide_outline_prompt_manager(
  major_version="1",
  minor_version="0",
)
```

Then, throughout your application, import the entire `prompt_managers`
module and use the prompt managers as needed:

```python
from my_project import prompt_managers

with prompt_managers.text_summarization.exec() as prompt:
  ...

with prompt_managers.flashcard_generator.exec() as prompt:
  ...

with prompt_managers.study_guide_outline.exec() as prompt:
  ...
```

This is preferable over importing each prompt manager individually, as it keeps
the context of it being a prompt manager in the name. If you were to import each
manager individually, it is hard to tell at a glance that it is a prompt manager:

```python
from my_project.prompt_managers import text_summarization

# Somewhere deep in a file, it is not clear
# what the `text_summarization` variable is
with text_summarization.exec() as prompt:
  ...
```


# SDK Reference
Source: https://docs.autoblocks.ai/v2/guides/prompt-management/python/sdk-reference

Reference documentation for the Autoblocks Python Prompt SDK, including the AutoblocksPromptManager and PromptExecutionContext classes.

# Python Prompt SDK Reference

## `AutoblocksPromptManager`

This is the base class the autogenerated prompt manager classes inherit from.
Below are the arguments that can be passed when initializing a prompt manager:

| name               | required | default                                   | description                                                                                                                                                                                                                              |
| ------------------ | -------- | ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `major_version`    | true     |                                           | The major version of the prompt to use. Can be a specific version number or "undeployed" for local development.                                                                                                                          |
| `minor_version`    | true     |                                           | Can be one of: a specific minor version or "latest"                                                                                                                                                                                      |
| `api_key`          | false    | `AUTOBLOCKS_API_KEY` environment variable | Your Autoblocks API key.                                                                                                                                                                                                                 |
| `refresh_interval` | false    | `timedelta(seconds=10)`                   | How often to refresh the latest prompt. Only relevant if the minor version is set to `"latest"` or `"latest"` is used in the weighted list.                                                                                              |
| `refresh_timeout`  | false    | `timedelta(seconds=30)`                   | How long to wait for the latest prompt to refresh before timing out. A refresh timeout will not raise an uncaught exception. An error will be logged and the background refresh process will continue to run at its configured interval. |
| `init_timeout`     | false    | `timedelta(seconds=30)`                   | How long to wait for the prompt manager to be ready before timing out.                                                                                                                                                                   |

<CodeGroup>
  ```python specific minor version
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="1",
    minor_version="0",
  )
  ```

  ```python latest minor version
  from my_project.autoblocks_prompts import my_app

  mgr = my_app.text_summarization_prompt_manager(
    major_version="1",
    minor_version="latest",
  )
  ```
</CodeGroup>

### `exec`

A [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) that starts a prompt execution context by creating a new
[`PromptExecutionContext`](#prompt-execution-context) instance.

```python
with mgr.exec() as prompt:
    ...
```

## `PromptExecutionContext`

An instance of this class is created every time a new execution context is started with the `exec` context manager.
It contains a frozen copy of the prompt manager's in-memory prompt at the time `exec` was called.
This ensures the prompt is stable for the duration of an execution, even if the in-memory prompt on the manager
instance is refreshed mid-execution.

### `params`

A `pydantic` model instance with the prompt's parameters.

```python
with mgr.exec() as prompt:
    params = dict(
        model=prompt.params.model,
        temperature=prompt.params.temperature,
        ...
    )
```

### `render_template`

The `render_template` attribute contains an instance of a class that has methods for rendering each of the prompt's templates.
The template IDs and template parameters are all converted to snake case so that the method and argument names follow
Python naming conventions.

For example, the prompt in the [quick start](./quick-start) guide contains the below templates:

#### `system`

```
Objective: You are provided with a document...

{{ languageRequirement }}
{{ toneRequirement }}
```

#### `user`

```
Document:

'''
{{ document }}
'''

Summary:
```

#### `util/language`

```
Always respond in {{ language }}.
```

#### `util/tone`

```
Always respond in a {{ tone }} tone.
```

From this, the CLI autogenerates a class with the following methods:

```python
def system(
    self,
    *,
    language_requirement: str,
    tone_requirement: str,
) -> str:
    ...

def user(self, *, document: str) -> str:
    ...

def util_language(self, *, language: str) -> str:
    ...

def util_tone(self, *, tone: str) -> str:
    ...
```

As a result, you are able to render your templates with functions that are aware of the required parameters for each template:

```python
with mgr.exec() as prompt:
    params = dict(
        model=prompt.params.model,
        temperature=prompt.params.temperature,
        max_tokens=prompt.params.max_tokens,
        messages=[
            dict(
                role="system",
                content=prompt.render_template.system(
                    language_requirement=prompt.render_template.util_language(
                        language="Spanish",
                    ),
                    tone_requirement=prompt.render_template.util_tone(
                        tone="formal",
                    ),
                ),
            ),
            dict(
                role="user",
                content=prompt.render_template.user(
                  document="mock document",
                ),
            ),
        ],
    )
```

### `render_tool`

The `render_tool` attribute contains an instance of a class that has methods for rendering each of the prompt's tools.
The tool names and tool parameters are all converted to snake case so that the method and argument names follow
Python naming conventions. The tool will be in the JSON schema format that OpenAI expects.

```python
with mgr.exec() as prompt:
    params = dict(
        model=prompt.params.model,
        temperature=prompt.params.temperature,
        max_tokens=prompt.params.max_tokens,
        tools=[
            prompt.render_tool.my_tool(
                description="My description"
            ),
        ]
       # rest of params...
    )
```


# Quick Start
Source: https://docs.autoblocks.ai/v2/guides/prompt-management/typescript/quick-start

Get started with the Autoblocks TypeScript Prompt SDK to manage and execute your prompts with type safety and autocomplete.

# TypeScript Prompt SDK Quick Start

## Install

<CodeGroup>
  ```bash npm
  npm install @autoblocks/client
  ```

  ```bash yarn
  yarn add @autoblocks/client
  ```

  ```bash pnpm
  pnpm add @autoblocks/client
  ```
</CodeGroup>

## Create a Prompt App

Before creating prompts, you need to create a prompt app. Apps are the top-level organizational unit in Autoblocks and help you manage access and track usage of your prompts.

1. Go to the [apps page](https://app-v2.autoblocks.ai/apps)
2. Click "Create App"
3. Select "Prompt App" as the app type
4. Give your app a name and description
5. Configure access settings for your team members

<Note>
  All prompts you create will be associated with this app. Make sure to choose a name that reflects the purpose of your prompts (e.g., "Content Generation" or "Customer Support").
</Note>

## Generate types

In order to generate types, you need to set your Autoblocks API key from the [settings](https://app-v2.autoblocks.ai/settings/api-keys) page
as an environment variable:

```bash
export AUTOBLOCKS_API_KEY=...
```

Then, add the `prompts generate-v2` command to your `package.json` scripts:

```json
"scripts": {
  "gen": "prompts generate-v2"
}
```

You will need to run this script any time you deploy a new major version of your prompt.

<Note>
  Make sure to generate the types in your CI/CD pipeline before running type checking on your application.

  ```json
  "scripts": {
    "gen": "prompts generate-v2",
    "type-check": "npm run gen && tsc --noEmit"
  }
  ```
</Note>

## Initialize the prompt manager

Create a single instance of the prompt manager for the lifetime of your application.
When initializing the prompt manager, the major version must be pinned while the minor version can either be
pinned or set to `'latest'`:

<CodeGroup>
  ```ts pinned
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: '1',
      minor: '0',
    },
  });
  ```

  ```ts latest
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: '1',
      minor: 'latest',
    },
  });
  ```
</CodeGroup>

<Note>
  When the version is set to `'latest'`, the prompt manager periodically refreshes the in-memory prompt
  in the background according to the `refreshInterval`.
  See the [`AutoblocksPromptManager`](/v2/guides/prompt-management/typescript/sdk-reference#autoblocks-prompt-manager) reference for more information.
</Note>

## Wait for the manager to be ready

At the entrypoint to your application, wait for the prompt manager to be ready before handling requests.

```ts
await mgr.init();
```

## Execute a prompt

The `exec` method on the prompt manager starts a new prompt execution context.
It creates a [`PromptExecutionContext`](/v2/guides/prompt-management/typescript/sdk-reference#prompt-execution-context)
instance that gives you fully-typed access to the prompt's templates and parameters:

```ts
const response = await mgr.exec(async ({ prompt }) => {
  const params: ChatCompletionCreateParamsNonStreaming = {
    model: prompt.params.model,
    temperature: prompt.params.temperature,
    messages: [
      {
        role: 'system',
        content: prompt.renderTemplate({
          template: 'system',
          params: {
            languageRequirement: prompt.renderTemplate({
              template: 'util/language',
              params: {
                language: 'Spanish',
              },
            }),
            toneRequirement: prompt.renderTemplate({
              template: 'util/tone',
              params: {
                tone: 'silly',
              },
            }),
          },
        }),
      },
      {
        role: 'user',
        content: prompt.renderTemplate({
          template: 'user',
          params: {
            document: 'mock document',
          },
        }),
      },
    ],
  };

  const response = await openai.chat.completions.create(params);

  return response;
});
```

## Develop locally against a prompt revision that hasn't been deployed

As you create new revisions in the UI, your private revisions (or revisions that have been shared by your teammates)
can be pulled down using `dangerously-use-undeployed`:

<CodeGroup>
  ```ts latest
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: 'dangerously-use-undeployed',
      minor: 'latest',
    },
  });
  ```

  ```ts pinned
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: 'dangerously-use-undeployed',
      minor: 'clvods6wq0003m44zc8sizv2l',
    },
  });
  ```
</CodeGroup>

<Note>
  As the name suggests, this should only be used in local development and never in production.
</Note>

## Organizing multiple prompt managers

If you are using many prompt managers, we recommend initializing them in a single file and importing them as a module:

`prompts.ts`:

```typescript
import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

const refreshInterval = { seconds: 5 };

const managers = {
  textSummarization: new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: '1',
      minor: 'latest',
    },
    refreshInterval,
  }),
  flashcardGenerator: new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'flashcard-generator',
    version: {
      major: '1',
      minor: 'latest',
    },
    refreshInterval,
  }),
  studyGuideOutline: new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'study-guide-outline',
    version: {
      major: '1',
      minor: 'latest',
    },
    refreshInterval,
  }),
};

async function init() {
  await Promise.all(Object.values(managers).map(mgr => mgr.init()));
}

export default {
  init,
  ...managers,
};
```

Make sure to call `init` at the entrypoint of your application:

```typescript
import prompts from '~/prompts';

async function start() {
  await prompts.init();

  ...
}
```

Then, throughout your application, import the entire `prompts`
module and use the prompt managers as needed:

```typescript
import prompts from '~/prompts';

prompts.textSummarization.exec(({ prompt }) => {
  ...
});

prompts.flashcardGenerator.exec(({ prompt }) => {
  ...
});

prompts.studyGuideOutline.exec(({ prompt }) => {
  ...
});
```


# SDK Reference
Source: https://docs.autoblocks.ai/v2/guides/prompt-management/typescript/sdk-reference

Reference documentation for the Autoblocks TypeScript Prompt SDK, including the AutoblocksPromptManager and PromptExecutionContext classes.

# TypeScript Prompt SDK Reference

## `AutoblocksPromptManager`

Below are the arguments that can be passed when initializing the prompt manager:

| name              | required | default                                   | description                                                                                                                                                                                                                          |
| ----------------- | -------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `appName`         | true     |                                           | The name of the app that contains the prompt.                                                                                                                                                                                        |
| `id`              | true     |                                           | The ID of the prompt.                                                                                                                                                                                                                |
| `version.major`   | true     |                                           | Must be pinned to a specific major version.                                                                                                                                                                                          |
| `version.minor`   | true     |                                           | Can be one of: a specific minor version or the string `"latest"`.                                                                                                                                                                    |
| `apiKey`          | false    | `AUTOBLOCKS_API_KEY` environment variable | Your Autoblocks API key.                                                                                                                                                                                                             |
| `refreshInterval` | false    | `{ seconds: 10 }`                         | How often to refresh the latest prompt. Only relevant if the minor version is set to `"latest"`.                                                                                                                                     |
| `refreshTimeout`  | false    | `{ seconds: 30 }`                         | How long to wait for the latest prompt to refresh before timing out. A refresh timeout will not throw an uncaught error. An error will be logged and the background refresh process will continue to run at its configured interval. |
| `initTimeout`     | false    | `{ seconds: 30 }`                         | How long to wait for the prompt manager to be ready (when calling `init()`) before timing out.                                                                                                                                       |

<CodeGroup>
  ```ts pinned
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: '1',
      minor: '0',
    },
  });
  ```

  ```ts latest
  import { AutoblocksPromptManager } from '@autoblocks/client/prompts';

  const mgr = new AutoblocksPromptManager({
    appName: 'my-app',
    id: 'text-summarization',
    version: {
      major: '1',
      minor: 'latest',
    },
    refreshInterval: {
      seconds: 5,
    },
  });
  ```
</CodeGroup>

### `exec`

Starts a prompt execution context by creating a new [`PromptExecutionContext`](#prompt-execution-context) instance.

```ts
const response = await mgr.exec(async ({ prompt }) => {
  ...
});
```

## `PromptExecutionContext`

An instance of this class is created every time a new execution context is started with the `exec` function.
It contains a frozen copy of the prompt manager's in-memory prompt at the time `exec` was called.
This ensures the prompt is stable for the duration of an execution, even if the in-memory prompt on the manager
instance is refreshed mid-execution.

### `params`

An object with the prompt's parameters.

```ts
const response = await mgr.exec(async ({ prompt }) => {
  const params: ChatCompletionCreateParamsNonStreaming = {
    model: prompt.params.model,
    temperature: prompt.params.temperature,
    ...
  };
});
```

### `renderTemplate`

The `renderTemplate` function accepts a template ID and parameters and returns the rendered template as a string.

| name     | required | description                                                                                                                      |
| -------- | -------- | -------------------------------------------------------------------------------------------------------------------------------- |
| template | true     | The ID of the template to render.                                                                                                |
| params   | true     | The parameters to pass to the template. These values are used to replace the template parameters wrapped in double curly braces. |

```ts
const response = await mgr.exec(async ({ prompt }) => {
  // Use `prompt.renderTemplate` to render a template
  console.log(prompt.renderTemplate(
    {
      template: 'util/language',
      params: {
        // Replaces "{{ language }}" with "Spanish"
        language: 'Spanish',
      },
    },
  ));
  // Logs "Always respond in Spanish."
});
```

### `renderTool`

The `renderTool` function accepts a tool name and parameters and returns the rendered tool as an object in the JSON schema format that OpenAI expects.

| name   | required | description                                                                                                              |
| ------ | -------- | ------------------------------------------------------------------------------------------------------------------------ |
| tool   | true     | The name of the tool to render.                                                                                          |
| params | true     | The parameters to pass to the tool. These values are used to replace the tool parameters wrapped in double curly braces. |

```ts
const response = await mgr.exec(async ({ prompt }) => {
  // Use `prompt.renderTool` to render a tool
  console.log(prompt.renderTool(
    {
      tool: 'MyTool',
      params: {
        // Replaces "{{ language }}" with "Spanish"
        language: 'Spanish',
      },
    },
  ));
});
```


# Overview
Source: https://docs.autoblocks.ai/v2/guides/rbac/overview

Learn how to manage roles, teams, and access control in Autoblocks, including SSO integration and granular app permissions.

Autoblocks provides robust Role-Based Access Control (RBAC) and team management features, enabling you to efficiently manage user access and permissions across your organization.

## Single Sign-On (SSO) Integration

Autoblocks supports various SSO options to streamline user authentication and access management:

* **Social SSO**: Integrate with popular social identity providers for easy access.
* **Enterprise SSO**: Connect with enterprise identity providers for secure, centralized authentication.
* **Directory Sync**: Synchronize user directories to maintain up-to-date access controls.

## Custom Role Creation and Mapping

Autoblocks allows you to create and map custom roles tailored to your organization's needs:

* **Permission-Based Roles**: Define roles based on specific permissions, enabling precise control over user access.
* **Flexible Configuration**: Customize roles to align with your organizational structure and requirements.
* **Granular Permissions**: Assign fine-grained permissions to ensure users have exactly the access they need.

## Granular Access Control

Autoblocks allows granular access control at the app level:

* **App-Specific Permissions**: Assign roles and permissions specific to individual apps.
* **PHI Control**: In healthcare settings, control which apps may contain Protected Health Information (PHI).

## Team Management

Efficiently manage teams and their access:

* **Team Creation**: Create teams to group users with similar roles and permissions.
* **Access Assignment**: Assign roles and permissions to teams for streamlined management.
* **Audit Logs**: Track changes and access patterns for compliance and security.

## Best Practices

* **Regular Reviews**: Periodically review and update roles and permissions to ensure alignment with organizational needs.
* **Least Privilege**: Apply the principle of least privilege to minimize access risks.
* **Audit and Compliance**: Utilize audit logs to maintain compliance with industry standards and regulations.

## Next Steps

* [Security and Compliance](/v2/deployment/security-and-compliance)


# Running in CI
Source: https://docs.autoblocks.ai/v2/guides/testing/ci

Learn how to run your Autoblocks tests in a CI environment.

## Prerequisites

Make sure you've followed our [quick start guide](./quick-start) and have your tests running locally.

Once you have your tests set up and running locally,
you can run them in a CI environment via the same `npx autoblocks testing exec` command.

## Setup

### Create secrets

You will need to add your Autoblocks API key as a secret in your GitHub repository.
Make sure to include any other secrets your tests depend on, such as your OpenAI API key.

You can get your Autoblocks API key from the [settings](https://app-v2.autoblocks.ai/settings/api-keys)
page and follow the instructions [here](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#creating-secrets-for-a-repository)
for adding a repository secret.

### Ensure your commit email address is configured

We associate test runs on feature branches with your Autoblocks user based on the commit author's email address.
Make sure your commit email address matches your Autoblocks user email address.

You can update your commit address globally:

```bash
git config --global user.email "myname@mycompany.com"
```

Or within a specific repository:

```bash
git config user.email "myname@mycompany.com"
```

<Hint>
  Check your existing email address:

  ```bash
  git config --global user.email
  ```

  Or for a particular repository:

  ```bash
  git config user.email
  ```
</Hint>

## Examples

### Run on every push

Run your tests on every [push](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#push) to your repository:

```yaml
name: Autoblocks Tests

on: push

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      # Set up your environment like you would for your
      # unit or integration tests
      ...

      # Run the Autoblocks tests
      - name: Run Autoblocks tests
        run: npx autoblocks testing exec -- <cmd>
        env:
          # This is created automatically by GitHub
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # These were added by you to your repository's secrets
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          AUTOBLOCKS_API_KEY: ${{ secrets.AUTOBLOCKS_API_KEY }}
```

<Note>
  **Why do you need to set the `GITHUB_TOKEN` as an environment variable?**

  The CLI uses the [automated `GITHUB_TOKEN` secret](https://docs.github.com/en/actions/security-guides/automatic-token-authentication)
  to fetch metadata about the run that is not directly available within the run's [context](https://docs.github.com/en/actions/learn-github-actions/contexts).

  The `GITHUB_TOKEN` is created for you automatically by GitHub and is available to use in your workflow file.
  You do not need to add it to your repository's secrets.
  Its scope is determined by your workflow's [permissions](https://docs.github.com/en/actions/using-jobs/assigning-permissions-to-jobs).
</Note>

### Run only when a pull request is opened or updated

Run your tests when a [pull request](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request) is opened or updated:

```yaml
name: Autoblocks Tests

on: pull_request

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest
    
    ...
```

### Run on a schedule

Run your tests on a [schedule](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule):

```yaml
name: Autoblocks Tests

on:
  schedule: # Run every day at ~7:17am PST.
    - cron: '17 15 * * *'

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest
    
    ...
```

<Hint>
  Schedule your workflow to run at a minute that isn't 0;
  this is recommended by the GitHub team to avoid [overload](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule)
  and reduce the risk your job is dropped.
</Hint>

### Run manually

Run your tests manually via a [workflow\_dispatch](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch) event:

```yaml
name: Autoblocks Tests

on: workflow_dispatch

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest
    
    ...
```

See the GitHub documentation on [manually running a workflow](https://docs.github.com/en/actions/using-workflows/manually-running-a-workflow).

### Run only on certain commits

This workflow will run your tests only when a commit contains a certain string.
For example, you can run your tests only when a commit message contains `[autoblocks]`:

```yaml
name: Autoblocks Tests

# This workflow will run on every push, but the job will not run unless
# the string "[autoblocks]" is somewhere in the commit message
on: push

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    if: contains(github.event.head_commit.message, '[autoblocks]')
    
    ...
```

Or, if you are triggering on pull requests:

```yaml
name: Autoblocks Tests

# This workflow will run on every pull request update, but the job will not run unless
# the string "[autoblocks]" is somewhere in the pull request title
on: pull_request

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    if: contains(github.event.pull_request.title, '[autoblocks]')
    
    ...
```

### Run in multiple scenarios

You can use a combination of all of the above to run your tests in multiple scenarios:

```yaml
name: Autoblocks Tests

on:
  # Run on every push when the commit message contains "[autoblocks]".
  # See below for the if condition that enforces this.
  push:

  # Run manually.
  workflow_dispatch:

  # Run every day at ~7:17am PST.
  schedule: 
    - cron: '17 15 * * *'

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    # For push events, check the commit message contains "[autoblocks]".
    if: github.event_name != 'push' || contains(github.event.head_commit.message, '[autoblocks]')
    
    ...
```

### More examples

Check out some end to end examples in our [examples repository](https://github.com/autoblocksai/autoblocks-examples):

* Python:
  * [Project](https://github.com/autoblocksai/autoblocks-examples/tree/main/Python/testing-sdk)
  * [Workflow](https://github.com/autoblocksai/autoblocks-examples/blob/d37c47cad34787183637faf58e63018404ffcc95/.github/workflows/autoblocks-testing.yml#L9-L40)
* TypeScript:
  * [Project](https://github.com/autoblocksai/autoblocks-examples/tree/main/JavaScript/testing-sdk)
  * [Workflow](https://github.com/autoblocksai/autoblocks-examples/blob/d37c47cad34787183637faf58e63018404ffcc95/.github/workflows/autoblocks-testing.yml#L42-L67)

## Slack Notifications

You can receive Slack notifications with test results by setting the `AUTOBLOCKS_SLACK_WEBHOOK_URL` environment variable. To get set up:

* Create a webhook URL for the channel where you want results to be posted by following Slack's instructions on [setting up incoming webhooks](https://api.slack.com/messaging/webhooks)
* Add this webhook URL to your repository as a secret called `AUTOBLOCKS_SLACK_WEBHOOK_URL`
  in addition to the secrets you already added in the [setup](#setup-create-secrets) section

Then, set the environment variable in your workflow:

```yaml
name: Autoblocks Tests

on: push

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    steps:
      ...

      - name: Run Autoblocks tests
        run: npx autoblocks testing exec -- <cmd>
        env:
          # This is created automatically by GitHub
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # These were added by you to your repository's secrets
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          AUTOBLOCKS_API_KEY: ${{ secrets.AUTOBLOCKS_API_KEY }}
          # Posts a message to a Slack channel with the test results
          AUTOBLOCKS_SLACK_WEBHOOK_URL: ${{ secrets.AUTOBLOCKS_SLACK_WEBHOOK_URL }}
```

If you have multiple triggers for your workflow and only want to receive Slack notifications for some of them,
you can add a step that only [sets the environment variable](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-an-environment-variable) for certain event types:

```yaml
name: Autoblocks Tests

on:
  # This workflow has multiple triggers
  push:
  workflow_dispatch:
  schedule: 
    - cron: '17 15 * * *'

jobs:
  autoblocks-tests:
    runs-on: ubuntu-latest

    steps:
      ...

      # But we only want to send Slack notifications for scheduled runs
      - name: Set Autoblocks Slack webhook URL
        if: github.event_name == 'schedule'
        run: echo "AUTOBLOCKS_SLACK_WEBHOOK_URL=${{ secrets.AUTOBLOCKS_SLACK_WEBHOOK_URL }}" >> $GITHUB_ENV
      
      - name: Run Autoblocks tests
        run: npx autoblocks testing exec -- <cmd>
        env:
          # This is created automatically by GitHub
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          # These were added by you to your repository's secrets
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          AUTOBLOCKS_API_KEY: ${{ secrets.AUTOBLOCKS_API_KEY }}
          # The AUTOBLOCKS_SLACK_WEBHOOK_URL has already been set
          # above, so it doesn't need to be set here
```

## GitHub Comments

A summary of test results is posted automatically as a comment on the pull request associated with the commit that triggered the test run.
If a commit is pushed but a pull request hasn't been opened yet, the summary is posted on the commit instead.

<Note>
  Autoblocks updates the same comment with the latest test results each time the workflow runs.
  This prevents the comment section from becoming cluttered.
</Note>

<Note>
  If you would like to disable comments, you can set the `AUTOBLOCKS_DISABLE_GITHUB_COMMENTS` environment variable to `1`.
</Note>

### GitHub Actions Permissions (Advanced)

If your GitHub organization has set the [default permissions](https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization#setting-the-permissions-of-the-github_token-for-your-organization)
for `GITHUB_TOKEN` to be **restricted**, you will need to explicitly give the workflow permission to post comments on pull requests or commits.

* Commenting on a pull request from a `pull_request` event:

```yaml
name: Autoblocks Tests

on: pull_request

permissions:
  # Allow commenting on pull requests
  issues: write
```

* Commenting on a pull request from a `push` event:

```yaml
name: Autoblocks Tests

on: push

permissions:
  # Allow commenting on pull requests
  issues: write
  # Allow getting associated pull requests when event is type `push`
  pull-requests: read
```

## Other CI Providers

Using a CI provider other than GitHub Actions? Contact us at [support@autoblocks.ai](mailto:support@autoblocks.ai) and we'll help get you set up.


# Overview
Source: https://docs.autoblocks.ai/v2/guides/testing/overview

Learn about Autoblocks Testing and how it helps you test your LLM applications.

## Key Features

Autoblocks Testing provides a powerful framework for testing your LLM applications. It enables you to declaratively define tests and execute them either locally or in a CI/CD pipeline.

### Declarative Test Definition

Define your tests using a simple, declarative API that works with both TypeScript and Python. Your tests can exist as standalone scripts or be integrated into your existing test framework.

### Flexible Test Cases

Create test cases that match your application's needs. Test cases can contain any properties necessary to run your tests and make assertions on the output.

### Powerful Evaluators

Build custom evaluators to assess your test outputs. Evaluators can:

* Score outputs on a scale from 0 to 1
* Define pass/fail thresholds
* Include metadata for better debugging
* Support both synchronous and asynchronous evaluation
* Handle concurrent execution with configurable limits

### Local and CI/CD Support

Run your tests:

* Locally during development
* In your CI/CD pipeline for automated testing
* With progress tracking and real-time results

### Rich Results Visualization

View detailed test results in the Autoblocks platform, including:

* Test suite progress
* Individual test case results
* Evaluation scores and metadata
* Failure analysis

## Getting Started

Choose your preferred language to begin:

<CodeGroup>
  <CodeGroupItem title="TypeScript">
    ```typescript
    import { runTestSuite } from '@autoblocks/client/testing';

    runTestSuite({
      id: 'my-test-suite',
      testCases: genTestCases(),
      evaluators: [new HasAllSubstrings(), new IsFriendly()],
      fn: testFn,
    });
    ```
  </CodeGroupItem>

  <CodeGroupItem title="Python">
    ```python
    from autoblocks.testing.run import run_test_suite

    run_test_suite(
      id="my-test-suite",
      test_cases=gen_test_cases(),
      evaluators=[HasAllSubstrings(), IsFriendly()],
      fn=test_fn,
    )
    ```
  </CodeGroupItem>
</CodeGroup>

## Core Concepts

### Test Cases

Test cases define the inputs and expected outputs for your tests. They can be simple or complex, depending on your needs.

### Evaluators

Evaluators assess the output of your tests and determine if they pass or fail. They can:

* Score outputs numerically
* Define pass/fail thresholds
* Include metadata for debugging
* Support both sync and async evaluation

### Test Suites

Test suites bring together your test cases and evaluators to run comprehensive tests on your application.

## Next Steps

* [TypeScript Quick Start](/v2/guides/testing/typescript/quick-start)
* [Python Quick Start](/v2/guides/testing/python/quick-start)
* [CI/CD Integration](/v2/guides/testing/ci)


# Python Quick Start
Source: https://docs.autoblocks.ai/v2/guides/testing/python/quick-start

Get started with the Autoblocks Testing SDK for Python.

## Overview

Autoblocks Testing enables you to declaratively define tests for your LLM application and execute them either locally
or in a CI/CD pipeline. Your tests can exist in a standalone script or be executed as part of a larger test framework.

```python
run_test_suite(
  id="my-test-suite",
  test_cases=gen_test_cases(),
  evaluators=[HasAllSubstrings(), IsFriendly()],
  fn=test_fn,
)
```

## Getting Started

### Install the SDK

```bash
poetry add autoblocksai
# or
pip install autoblocksai
```

### Define your test case schema

Your test case schema should contain all of the properties
necessary to run your test function and to then make assertions
on the output via your evaluators.
This schema can be anything you want in order to facilitate testing your application.

```python
import dataclasses

from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.util import md5

@dataclasses.dataclass
class MyTestCase(BaseTestCase):
    """
    A test case can be any class that subclasses BaseTestCase.

    This example is a dataclass, but it could also be a pydantic model,
    plain Python class, etc.
    """
    input: str
    expected_substrings: list[str]

    def hash(self) -> str:
        """
        This hash serves as a unique identifier for a test case throughout its lifetime.

        Required to be implemented by subclasses of BaseTestCase.
        """
        return md5(self.input)
```

### Implement a function to test

This function should take an instance of a test case and return an output.
The function can be synchronous or asynchronous and the output can be anything: a string, a number, a complex object, etc.

For this example, we are splitting the test case's `input` property on its hyphens and randomly discarding some of the substrings
to simulate failures on the `has-all-substrings` evaluator.

```python
import random
import asyncio

async def test_fn(test_case: MyTestCase) -> str:
    """ This could also be a synchronous function. """
    # Simulate doing work
    await asyncio.sleep(random.random())

    substrings = test_case.input.split("-")
    if random.random() < 0.2:
        # Remove a substring randomly. This will cause about 20% of the test cases to fail
        # the "has-all-substrings" evaluator.
        substrings.pop()

    return "-".join(substrings)
```

### Create an evaluator

Evaluators allow you to attach an [`Evaluation`](#reference-evaluation) to a test case's output,
where the output is the result of running the test case through the function you are testing.
Your test suite can have multiple evaluators.
The evaluation method that you implement on the evaluator will have access to both the test case
instance and the output of the test function over the given test case. Your evaluation method
can be synchronous or asynchronous, but it must return an instance of `Evaluation`.

The evaluation must have a score between 0 and 1, and you can optionally attach a
[`Threshold`](#reference-threshold) that describes the range the score must be in in order
to be considered passing. If no threshold is attached, the score is reported and the pass / fail
status is undefined. Evaluations can also have metadata attached to them, which can be useful
for providing additional context when an evaluation fails.

For this example we'll define two evaluators:

```python
import random
import asyncio

from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold

class HasAllSubstrings(BaseTestEvaluator):
    """
    An evaluator is a class that subclasses BaseTestEvaluator.

    It must specify an ID, which is a unique identifier for the evaluator.
    """
    id = "has-all-substrings"

    def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
        """
        Evaluates the output of a test case.

        Required to be implemented by subclasses of BaseTestEvaluator.
        This method can be synchronous or asynchronous.
        """
        missing_substrings = [s for s in test_case.expected_substrings if s not in output]
        score = 0 if missing_substrings else 1
        return Evaluation(
            score=score,
            # If the score is not greater than or equal to 1,
            # this evaluation will be marked as a failure.
            threshold=Threshold(gte=1),
            metadata=dict(
                # Include the missing substrings as metadata
                # so that we can easily see which strings were
                # missing when viewing a failed evaluation
                # in the Autoblocks UI.
                missing_substrings=missing_substrings,
            ),
        )

class IsFriendly(BaseTestEvaluator):
    id = "is-friendly"

    # The maximum number of concurrent calls to `evaluate_test_case` allowed for this evaluator.
    # Useful to avoid rate limiting from external services, such as an LLM provider.
    max_concurrency = 5  

    async def get_score(self, output: str) -> float:
        # Simulate doing work
        await asyncio.sleep(random.random())

        # Simulate a friendliness score, e.g. as determined by an LLM.
        return random.random()

    async def evaluate_test_case(self, test_case: BaseTestCase, output: str) -> Evaluation:
        """
        This can also be an async function. This is useful if you are interacting
        with an external service that requires async calls, such as OpenAI,
        or if the evaluation you are performing could benefit from concurrency.
        """
        score = await self.get_score(output)
        return Evaluation(
            score=score,
            # Evaluations don't need thresholds attached to them.
            # In this case, the evaluation will just consist of the score.
        )
```

<Hint>
  An evaluator can be used across many test suites!
  The recommended approach for this is to create your own abstract class that subclasses `BaseTestEvaluator`
  and implements any shared logic, then subclass that abstract class for each test suite.

  See the [example](https://github.com/autoblocksai/autoblocks-examples/blob/main/Python/testing-sdk/my_project/evaluators/has_substrings.py)
  and the [Python documentation](https://docs.python.org/3/library/abc.html) on abstract classes.
</Hint>

### Create a test suite

We now have all of the pieces necessary to run a test suite.
Below we'll generate some toy test cases in the schema
we defined above, where the input is a random UUID and its
expected substrings are the substrings of the UUID when split
by "-":

```python
import uuid

from autoblocks.testing.v2.run import run_test_suite

def gen_test_cases(n: int) -> list[MyTestCase]:
    test_cases = []
    for _ in range(n):
        random_id = str(uuid.uuid4())
        test_cases.append(
            MyTestCase(
                input=random_id,
                expected_substrings=random_id.split("-"),
            ),
        )
    return test_cases

run_test_suite(
    id="my-test-suite",
    fn=test_fn,
    test_cases=gen_test_cases(400),
    evaluators=[
        HasAllSubstrings(),
        IsFriendly(),
    ],
    # The maximum number of test cases that can be running
    # concurrently through `fn`. Useful to avoid rate limiting
    # from external services, such as an LLM provider.
    max_test_case_concurrency=10,
)
```

### Run the test suite locally

To execute this test suite, first get your **local testing API key** from the [settings page](https://app.autoblocks.ai/settings/api-keys)
and set it as an environment variable:

```bash
export AUTOBLOCKS_API_KEY=...
```

Make sure you've followed our [CLI setup instructions](/cli/setup) and then run the following:

```bash
# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py
```

The [`autoblocks testing exec`](/cli/commands#testing-exec) command will show the progress of all test suites in your terminal
and also send the results to Autoblocks:

<AutoblocksImage src="/images/testing/quick-start/cli-dots.png" alt="Testing CLI" />

You can view details of the results by clicking on the link displayed in the terminal or by visiting
the [test suites page](https://app.autoblocks.ai/testing/local) in the Autoblocks platform.

## Examples

To see a more complete example, check out our [Python example repository](https://github.com/autoblocksai/autoblocks-examples/tree/main/Python/testing-sdk).


# Python SDK Reference
Source: https://docs.autoblocks.ai/v2/guides/testing/python/sdk-reference

Technical reference for the Autoblocks Python SDK testing functionality.

# Python SDK Reference

## `run_test_suite`

The main entrypoint into the testing framework.

| name                        | required | type                            | description                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| --------------------------- | -------- | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                        | true     | `string`                        | A unique ID for the test suite. This will be displayed in the Autoblocks platform and should remain the same for the lifetime of a test suite.                                                                                                                                                                                                                                                                                                                             |
| `test_cases`                | true     | `list[BaseTestCase]`            | A list of instances that subclass `BaseTestCase`. These are typically [dataclasses](https://docs.python.org/3/library/dataclasses.html) and can be any schema that facilitates testing your application. They will be passed directly to `fn` and will also be made available to your evaluators. `BaseTestCase` is an abstract base class that requires you to implement the `hash` function. See [Test case hashing](#reference-test-case-hashing) for more information. |
| `evaluators`                | true     | `list[BaseTestEvaluator]`       | A list of instances that subclass [`BaseTestEvaluator`](#reference-base-test-evaluator).                                                                                                                                                                                                                                                                                                                                                                                   |
| `fn`                        | true     | `Callable[[BaseTestCase], Any]` | The function you are testing. Its only argument is an instance of a test case. This function can be synchronous or asynchronous and can return any type.                                                                                                                                                                                                                                                                                                                   |
| `max_test_case_concurrency` | false    | `int`                           | The maximum number of test cases that can be running concurrently through `fn`. Useful to avoid rate limiting from external services, such as an LLM provider.                                                                                                                                                                                                                                                                                                             |

## `BaseTestCase`

An abstract base class that you can subclass to create your own test cases.

| name   | required | type                | description                                                                                                                                                         |
| ------ | -------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `hash` | true     | `Callable[[], str]` | A method that returns a string that uniquely identifies the test case for its lifetime. See [Test case hashing](#reference-test-case-hashing) for more information. |

## `BaseTestEvaluator`

An abstract base class that you can subclass to create your own evaluators.

| name                 | required | type                                                  | description                                                                                                                                                              |
| -------------------- | -------- | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `id`                 | true     | `string`                                              | A unique identifier for the evaluator.                                                                                                                                   |
| `max_concurrency`    | false    | `number`                                              | The maximum number of concurrent calls to `evaluate_test_case` allowed for the evaluator. Useful to avoid rate limiting from external services, such as an LLM provider. |
| `evaluate_test_case` | true     | `Callable[[BaseTestCase, Any], Optional[Evaluation]]` | Creates an evaluation on a test case and its output. This method can be synchronous or asynchronous.                                                                     |

## `Evaluation`

A class that represents the result of an evaluation.

| name        | required | type        | description                                                                                                                                                                                                                    |
| ----------- | -------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `score`     | true     | `number`    | A number between 0 and 1 that represents the score of the evaluation.                                                                                                                                                          |
| `threshold` | false    | `Threshold` | An optional [`Threshold`](#reference-threshold) that describes the range the score must be in in order to be considered passing. If no threshold is attached, the score is reported and the pass / fail status is undefined.   |
| `metadata`  | false    | `object`    | Key-value pairs that provide additional context about the evaluation. This is typically used to explain why an evaluation failed. Attached metadata is surfaced in the [test run comparison](/testing/test-run-comparison) UI. |

## `Threshold`

A class that defines the passing criteria for an evaluation.

| name  | required | type     | description                                                                                   |
| ----- | -------- | -------- | --------------------------------------------------------------------------------------------- |
| `lt`  | false    | `number` | The score must be **less than** this number in order to be considered passing.                |
| `lte` | false    | `number` | The score must be **less than or equal** to this number in order to be considered passing.    |
| `gt`  | false    | `number` | The score must be **greater than** this number in order to be considered passing.             |
| `gte` | false    | `number` | The score must be **greater than or equal** to this number in order to be considered passing. |

## Example

<CodeGroup>
  ```python
  from dataclasses import dataclass
  from autoblocks.testing.models import BaseTestCase
  from autoblocks.testing.models import BaseTestEvaluator
  from autoblocks.testing.models import Evaluation
  from autoblocks.testing.models import Threshold
  from autoblocks.testing.run import run_test_suite
  from autoblocks.testing.util import md5

  @dataclass
  class MyTestCase(BaseTestCase):
      input: str
      expected_substrings: list[str]

      def hash(self) -> str:
          return md5(self.input)

  class HasAllSubstrings(BaseTestEvaluator):
      id = "has-all-substrings"

      def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
          score = 1.0
          for substring in test_case.expected_substrings:
              if substring not in output:
                  score = 0.0
                  break

          return Evaluation(
              score=score,
              threshold=Threshold(gte=1.0),
              metadata={
                  "reason": "Output must contain all expected substrings",
              },
          )

  run_test_suite(
      id="my-test-suite",
      test_cases=[
          MyTestCase(
              input="hello world",
              expected_substrings=["hello", "world"],
          )
      ],
      fn=lambda test_case: test_case.input,
      evaluators=[HasAllSubstrings()],
  )
  ```
</CodeGroup>


# TypeScript Quick Start
Source: https://docs.autoblocks.ai/v2/guides/testing/typescript/quick-start

Get started with the Autoblocks Testing SDK for TypeScript.

## Overview

Autoblocks Testing enables you to declaratively define tests for your LLM application and execute them either locally
or in a CI/CD pipeline. Your tests can exist in a standalone script or be executed as part of a larger test framework.

```typescript
runTestSuite<MyTestCase, string>({
  id: 'my-test-suite',
  testCases: genTestCases(),
  testCaseHash: ['input'],
  evaluators: [new HasAllSubstrings(), new IsFriendly()],
  fn: testFn,
});
```

## Getting Started

### Install the SDK

```bash
npm install @autoblocks/client
# or
yarn add @autoblocks/client
# or
pnpm add @autoblocks/client
```

### Define your test case schema

Your test case schema should contain all of the properties
necessary to run your test function and to then make assertions
on the output via your evaluators.
This schema can be anything you want in order to facilitate testing your application.

```typescript
interface MyTestCase {
  input: string;
  expectedSubstrings: string[];
}
```

### Implement a function to test

This function should take an instance of a test case and return an output.
The function can be synchronous or asynchronous and the output can be anything: a string, a number, a complex object, etc.

For this example, we are splitting the test case's `input` property on its hyphens and randomly discarding some of the substrings
to simulate failures on the `has-all-substrings` evaluator.

```typescript
async function testFn({ testCase }: { testCase: MyTestCase }): Promise<string> {
  // Simulate doing work
  await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));

  const substrings = testCase.input.split('-');
  if (Math.random() < 0.2) {
    // Remove a substring randomly. This will cause about 20% of the test cases to fail
    // the "has-all-substrings" evaluator.
    substrings.pop();
  }

  return substrings.join('-');
}
```

### Create an evaluator

Evaluators allow you to attach an [`Evaluation`](#reference-evaluation) to a test case's output,
where the output is the result of running the test case through the function you are testing.
Your test suite can have multiple evaluators.
The evaluation method that you implement on the evaluator will have access to both the test case
instance and the output of the test function over the given test case. Your evaluation method
can be synchronous or asynchronous, but it must return an instance of `Evaluation`.

The evaluation must have a score between 0 and 1, and you can optionally attach a
[`Threshold`](#reference-threshold) that describes the range the score must be in in order
to be considered passing. If no threshold is attached, the score is reported and the pass / fail
status is undefined. Evaluations can also have metadata attached to them, which can be useful
for providing additional context when an evaluation fails.

For this example we'll define two evaluators:

```typescript
import {
  BaseTestEvaluator,
  type Evaluation,
} from '@autoblocks/client/testing';

/**
 * An evaluator is a class that subclasses BaseTestEvaluator.
 *
 * It must specify an ID, which is a unique identifier for the evaluator.
 *
 * It has two required type parameters:
 * - TestCaseType: The type of your test cases.
 * - OutputType: The type of the output returned by the function you are testing.
 */
class HasAllSubstrings extends BaseTestEvaluator<MyTestCase, string> {
  id = 'has-all-substrings';

  /**
   * Evaluates the output of a test case.
   *
   * Required to be implemented by subclasses of BaseTestEvaluator.
   * This method can be synchronous or asynchronous.
   */
  evaluateTestCase(args: { testCase: MyTestCase; output: string }): Evaluation {
    const missingSubstrings = args.testCase.expectedSubstrings.filter(
      (s) => !args.output.includes(s),
    );
    const score = missingSubstrings.length ? 0 : 1;

    return {
      score,
      threshold: {
        // If the score is not greater than or equal to 1,
        // this evaluation will be marked as a failure.
        gte: 1,
      },
      metadata: {
        // Include the missing substrings as metadata
        // so that we can easily see which strings were
        // missing when viewing a failed evaluation
        // in the Autoblocks UI.
        missingSubstrings,
      },
    };
  }
}

class IsFriendly extends BaseTestEvaluator<MyTestCase, string> {
  id = 'is-friendly';

  // The maximum number of concurrent calls to `evaluateTestCase` allowed for this evaluator.
  // Useful to avoid rate limiting from external services, such as an LLM provider.
  maxConcurrency = 5;

  async getScore(output: string): Promise<number> {
    // Simulate doing work
    await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));

    // Simulate a friendliness score, e.g. as determined by an LLM.
    return Math.random();
  }

  /**
   * This can also be an async function. This is useful if you are interacting
   * with an external service that requires async calls, such as OpenAI, or if
   * the evaluation you are performing could benefit from concurrency.
   */
  async evaluateTestCase(args: {
    testCase: MyTestCase;
    output: string;
  }): Promise<Evaluation> {
    const score = await this.getScore(args.output);

    return {
      score,
      // Evaluations don't need thresholds attached to them.
      // In this case, the evaluation will just consist of the score.
    };
  }
}
```

<Hint>
  An evaluator can be used across many test suites!
  The recommended approach for this is to create your own abstract class that subclasses `BaseTestEvaluator`
  and implements any shared logic, then subclass that abstract class for each test suite.

  See the [example](https://github.com/autoblocksai/autoblocks-examples/blob/main/JavaScript/testing-sdk/src/evaluators/has-substrings.ts)
  and the [TypeScript documentation](https://www.typescriptlang.org/docs/handbook/classes.html#abstract-classes) on abstract classes.
</Hint>

### Create a test suite

We now have all of the pieces necessary to run a test suite.
Below we'll generate some toy test cases in the schema
we defined above, where the input is a random UUID and its
expected substrings are the substrings of the UUID when split
by "-":

```typescript
import crypto from 'crypto';
import { runTestSuite } from '@autoblocks/client/testing';

function genTestCases(n: number): MyTestCase[] {
  const testCases: MyTestCase[] = [];
  for (let i = 0; i < n; i++) {
    const randomId = crypto.randomUUID();
    testCases.push({
      input: randomId,
      expectedSubstrings: randomId.split('-'),
    });
  }
  return testCases;
}

(async () => {
  await runTestSuite<MyTestCase, string>({
    id: 'my-test-suite',
    fn: testFn,
    // Specify here either a list of properties that uniquely identify a test case
    // or a function that takes a test case and returns a hash. See the section on
    // hashing test cases for more information.
    testCaseHash: ['input'],
    testCases: genTestCases(400),
    evaluators: [
      new HasAllSubstrings(),
      new IsFriendly(),
    ],
    // The maximum number of test cases that can be running
    // concurrently through `fn`. Useful to avoid rate limiting
    // from external services, such as an LLM provider.
    maxTestCaseConcurrency: 10,
  });
})();
```

### Run the test suite locally

To execute this test suite, first get your **local testing API key** from the [settings page](https://app.autoblocks.ai/settings/api-keys)
and set it as an environment variable:

```bash
export AUTOBLOCKS_API_KEY=...
```

Make sure you've followed our [CLI setup instructions](/cli/setup) and then run the following:

```bash
# Assumes you've saved the above code in a file called run.ts
npx autoblocks testing exec -m "my first run" -- npx tsx run.ts
```

The [`autoblocks testing exec`](/cli/commands#testing-exec) command will show the progress of all test suites in your terminal
and also send the results to Autoblocks:

<AutoblocksImage src="/images/testing/quick-start/cli-dots.png" alt="Testing CLI" />

You can view details of the results by clicking on the link displayed in the terminal or by visiting
the [test suites page](https://app.autoblocks.ai/testing/local) in the Autoblocks platform.

## Examples

To see a more complete example, check out our [TypeScript example repository](https://github.com/autoblocksai/autoblocks-examples/tree/main/JavaScript/testing-sdk).


# TypeScript SDK Reference
Source: https://docs.autoblocks.ai/v2/guides/testing/typescript/sdk-reference

Technical reference for the Autoblocks TypeScript SDK testing functionality.

# TypeScript SDK Reference

## `runTestSuite`

The main entrypoint into the testing framework.

| name                     | required | type                                              | description                                                                                                                                                                                                                                                                                                                                                                            |
| ------------------------ | -------- | ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                     | true     | `string`                                          | A unique ID for the test suite. This will be displayed in the Autoblocks platform and should remain the same for the lifetime of a test suite.                                                                                                                                                                                                                                         |
| `testCases`              | true     | `BaseTestCase[]`                                  | A list of instances that subclass `BaseTestCase`. These can be any schema that facilitates testing your application. They will be passed directly to `fn` and will also be made available to your evaluators. `BaseTestCase` is an abstract base class that requires you to implement the `hash` function. See [Test case hashing](#reference-test-case-hashing) for more information. |
| `testCaseHash`           | false    | `(testCase: BaseTestCase) => string`              | An optional function that returns a string that uniquely identifies a test case for its lifetime. If not provided, the test case's `hash` method will be used.                                                                                                                                                                                                                         |
| `evaluators`             | true     | `BaseTestEvaluator[]`                             | A list of instances that subclass [`BaseTestEvaluator`](#reference-base-test-evaluator).                                                                                                                                                                                                                                                                                               |
| `fn`                     | true     | `(testCase: BaseTestCase) => Promise<any> or any` | The function you are testing. Its only argument is an instance of a test case. This function can be synchronous or asynchronous and can return any type.                                                                                                                                                                                                                               |
| `maxTestCaseConcurrency` | false    | `number`                                          | The maximum number of test cases that can be running concurrently through `fn`. Useful to avoid rate limiting from external services, such as an LLM provider.                                                                                                                                                                                                                         |

## `BaseTestEvaluator`

An abstract base class that you can subclass to create your own evaluators.

| name               | required | type                                                                                                   | description                                                                                                                                                            |
| ------------------ | -------- | ------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`               | true     | `string`                                                                                               | A unique identifier for the evaluator.                                                                                                                                 |
| `maxConcurrency`   | false    | `number`                                                                                               | The maximum number of concurrent calls to `evaluateTestCase` allowed for the evaluator. Useful to avoid rate limiting from external services, such as an LLM provider. |
| `evaluateTestCase` | true     | `(testCase: BaseTestCase, output: any) => Promise<Evaluation or undefined> or Evaluation or undefined` | Creates an evaluation on a test case and its output. This method can be synchronous or asynchronous.                                                                   |

## `Evaluation`

An interface that represents the result of an evaluation.

| name        | required | type                  | description                                                                                                                                                                                                                    |
| ----------- | -------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `score`     | true     | `number`              | A number between 0 and 1 that represents the score of the evaluation.                                                                                                                                                          |
| `threshold` | false    | `Threshold`           | An optional [`Threshold`](#reference-threshold) that describes the range the score must be in in order to be considered passing. If no threshold is attached, the score is reported and the pass / fail status is undefined.   |
| `metadata`  | false    | `Record<string, any>` | Key-value pairs that provide additional context about the evaluation. This is typically used to explain why an evaluation failed. Attached metadata is surfaced in the [test run comparison](/testing/test-run-comparison) UI. |

## `Threshold`

An interface that defines the passing criteria for an evaluation.

| name  | required | type     | description                                                                                   |
| ----- | -------- | -------- | --------------------------------------------------------------------------------------------- |
| `lt`  | false    | `number` | The score must be **less than** this number in order to be considered passing.                |
| `lte` | false    | `number` | The score must be **less than or equal** to this number in order to be considered passing.    |
| `gt`  | false    | `number` | The score must be **greater than** this number in order to be considered passing.             |
| `gte` | false    | `number` | The score must be **greater than or equal** to this number in order to be considered passing. |

## Example

<CodeGroup>
  ```typescript
  import { BaseTestCase, BaseTestEvaluator, Evaluation, Threshold, runTestSuite } from '@autoblocks/testing';

  interface MyTestCase extends BaseTestCase {
    input: string;
    expectedSubstrings: string[];
  }

  class HasAllSubstrings extends BaseTestEvaluator {
    id = 'has-all-substrings';

    async evaluateTestCase(testCase: MyTestCase, output: string): Promise<Evaluation> {
      let score = 1.0;
      for (const substring of testCase.expectedSubstrings) {
        if (!output.includes(substring)) {
          score = 0.0;
          break;
        }
      }

      return {
        score,
        threshold: { gte: 1.0 },
        metadata: {
          reason: 'Output must contain all expected substrings',
        },
      };
    }
  }

  await runTestSuite({
    id: 'my-test-suite',
    testCases: [
      {
        input: 'hello world',
        expectedSubstrings: ['hello', 'world'],
        hash: () => 'test-case-1',
      },
    ],
    fn: (testCase: MyTestCase) => testCase.input,
    evaluators: [new HasAllSubstrings()],
  });
  ```
</CodeGroup>


# Overview
Source: https://docs.autoblocks.ai/v2/guides/tracing/overview

Learn about Autoblocks Tracing and how it helps you monitor and debug your AI applications.

# Tracing Overview

Autoblocks Tracing provides powerful observability for your AI applications, helping you understand and debug complex AI workflows. It integrates with OpenTelemetry to provide detailed insights into your application's behavior.

## Key Features

### Comprehensive Tracing

* End-to-end request tracking
* Nested span support for complex operations
* Detailed timing and performance metrics
* Error tracking and debugging

### AI-Specific Instrumentation

* Automatic tracing of LLM calls
* Model performance monitoring
* Token usage tracking
* Response quality metrics

### Integration Flexibility

* TypeScript and Python SDK support
* OpenTelemetry compatibility
* Custom span creation
* Attribute management

### Debugging Tools

* Visual trace exploration
* Error analysis
* Performance optimization
* Usage patterns

## Core Concepts

### Spans

Spans represent individual operations in your application:

* Operation timing and duration
* Context and attributes
* Error tracking
* Parent-child relationships

### Traces

Traces show the complete flow of a request:

* End-to-end request tracking
* Nested operation visualization
* Performance analysis
* Error correlation

### Attributes

Add context to your traces:

* Custom metadata
* Performance metrics
* Error details
* Operation parameters

### Error Handling

Track and debug issues:

* Exception recording
* Error status tracking
* Stack trace capture
* Error correlation

## Next Steps

* [TypeScript Quick Start](/v2/guides/tracing/typescript/quick-start)
* [Python Quick Start](/v2/guides/tracing/python/quick-start)


# Quickstart
Source: https://docs.autoblocks.ai/v2/guides/tracing/python/quick-start

Learn how to use Autoblocks tracing in Python applications

## Installation

```bash
poetry add autoblocksai
# or
pip install autoblocksai
```

## Basic Example

Here's a simple example showing how to use tracing with OpenAI:

```python
import logging
from autoblocks.tracer import init_auto_tracer
from autoblocks.tracer import trace_app
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from openai import AsyncOpenAI

# Configure logging
logging.basicConfig(level=logging.DEBUG)

# Initialize tracing
init_auto_tracer()
OpenAIInstrumentor().instrument()

# Initialize OpenAI client
openai = AsyncOpenAI()

@trace_app("my-app", "production")
async def generate_response(prompt: str):
    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": prompt
        }]
    )
    return response.choices[0].message.content

# Run the application
import asyncio
asyncio.run(generate_response("Hello, how are you?"))
```

## Advanced Usage

### Creating Spans

You can create custom spans using OpenTelemetry's tracer:

```python
from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("my-tracer")

async def my_function():
    with tracer.start_as_current_span("my_operation", kind=SpanKind.INTERNAL) as span:
        try:
            # Your code here
            span.set_attribute("result", "success")
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise
```

### Error Handling

Record exceptions in spans:

```python
try:
    # Your code here
except Exception as e:
    span.record_exception(e)
    span.set_status(trace.Status(trace.StatusCode.ERROR))
    raise
```

### Nested Spans

Create nested spans to represent complex operations:

```python
async def complex_operation():
    with tracer.start_as_current_span("parent_operation") as parent_span:
        parent_span.set_attribute("parent_data", "value")
        
        with tracer.start_as_current_span("child_operation") as child_span:
            child_span.set_attribute("child_data", "value")
            # Child operation code
```

## Best Practices

1. **Span Naming**
   * Use descriptive names that reflect the operation
   * Follow a consistent naming convention
   * Include the operation type in the name

2. **Attribute Management**
   * Add relevant context to spans
   * Avoid logging sensitive data
   * Use consistent attribute names

3. **Error Handling**
   * Always record exceptions
   * Set appropriate error status
   * Include error details in attributes

4. **Performance**
   * Keep spans focused
   * Avoid excessive attributes
   * Use sampling for high-volume applications

5. **Async Operations**
   * Properly handle async/await with spans
   * Use context managers for span management
   * Use asyncio.gather for parallel operations

6. **Python Integration**
   * Use context managers for span management
   * Leverage Python's async/await patterns
   * Document complex operations

## Additional Resources

### OpenTelemetry

* [OpenTelemetry Python Documentation](https://opentelemetry.io/docs/instrumentation/python/)
* [OpenTelemetry Python API](https://opentelemetry.io/docs/instrumentation/python/api/)

### OpenLLMetry

* [OpenLLMetry Python](https://github.com/traceloop/openllmetry)
* [OpenLLMetry Python Documentation](https://traceloop.com/openllmetry)

## Next Steps

* [TypeScript Tracing Guide](/v2/guides/tracing/typescript/quick-start)
* [Tracing Overview](/v2/guides/tracing/overview)


# Quickstart
Source: https://docs.autoblocks.ai/v2/guides/tracing/typescript/quick-start

Learn how to use Autoblocks tracing in JavaScript/TypeScript applications

## Installation

```bash
npm install @autoblocks/client
```

## Basic Example

Here's a simple example showing how to use tracing with OpenAI:

```typescript
import { traceApp, init } from '@autoblocks/client/tracer';
import OpenAI from 'openai';

// Initialize tracing
init();

// Initialize OpenAI client
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Create a traced application
const generateResponse = async (prompt: string) => {
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }],
  });
  return response.choices[0].message.content;
};

// Run the application
await traceApp(
  'my-app',
  'production',
  generateResponse,
  this,
  {
    prompt: 'Hello, how are you?',
  }
);
```

## Advanced Usage

### Creating Spans

You can create custom spans using OpenTelemetry's tracer:

```typescript
import { trace } from '@opentelemetry/api';
import { SpanKind } from '@opentelemetry/api';

const tracer = trace.getTracer('my-tracer');

async function myFunction() {
  const span = tracer.startSpan('my_operation', {
    kind: SpanKind.INTERNAL,
    attributes: {
      custom_attribute: 'value'
    }
  });

  try {
    // Your code here
    span.setAttribute('result', 'success');
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}
```

### Error Handling

Record exceptions in spans:

```typescript
try {
  // Your code here
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
}
```

### Adding Attributes

Add attributes to spans for better context:

```typescript
span.setAttribute('model', 'gpt-4');
span.setAttribute('temperature', 0.7);
span.setAttribute('max_tokens', 100);
```

### Nested Spans

Create nested spans to represent complex operations:

```typescript
async function complexOperation() {
  const parentSpan = tracer.startSpan('parent_operation');
  parentSpan.setAttribute('parent_data', 'value');

  try {
    const childSpan = tracer.startSpan('child_operation', {
      parent: parentSpan
    });
    childSpan.setAttribute('child_data', 'value');

    try {
      // Child operation code
    } finally {
      childSpan.end();
    }
  } finally {
    parentSpan.end();
  }
}
```

## Best Practices

1. **Span Naming**
   * Use descriptive names that reflect the operation
   * Follow a consistent naming convention
   * Include the operation type in the name

2. **Attribute Management**
   * Add relevant context to spans
   * Avoid logging sensitive data
   * Use consistent attribute names

3. **Error Handling**
   * Always record exceptions
   * Set appropriate error status
   * Include error details in attributes

4. **Performance**
   * Keep spans focused
   * Avoid excessive attributes
   * Use sampling for high-volume applications

5. **Async Operations**
   * Properly handle async/await with spans
   * Ensure spans are ended in finally blocks
   * Use Promise.all for parallel operations

6. **TypeScript Integration**
   * Use proper types for spans and attributes
   * Leverage TypeScript's type system for safety
   * Document complex types and interfaces

## Additional Resources

### OpenTelemetry

* [OpenTelemetry JavaScript Documentation](https://opentelemetry.io/docs/instrumentation/js/)
* [OpenTelemetry JavaScript API](https://opentelemetry.io/docs/instrumentation/js/api/)

### OpenLLMetry

* [OpenLLMetry JavaScript](https://github.com/traceloop/openllmetry-js)
* [OpenLLMetry JavaScript Documentation](https://traceloop.com/openllmetry)

## Next Steps

* [Python Tracing Guide](/tracing/python)
* [Tracing Overview](/tracing/overview)


# Overview
Source: https://docs.autoblocks.ai/v2/guides/workflow-builder/overview

Learn about Autoblocks Workflow Builder and how it helps you create and manage complex LLM chains.

# Workflow Builder Overview

Autoblocks Workflow Builder provides a powerful platform for creating and managing chains of LLM calls. Using our visual no-code builder, you can quickly prototype complex LLM workflows and then choose how to run them in production.

## Key Features

### Visual Workflow Builder

* Drag-and-drop interface for creating LLM chains
* Rapid prototyping and testing
* Real-time validation and debugging
* Version control and management

### Production Execution Options

#### Autoblocks-Hosted Runtime

* Run workflows directly on Autoblocks infrastructure
* Managed scaling and reliability
* Built-in monitoring and analytics
* No additional infrastructure needed

#### SDK-Based Execution

* Pull workflow schemas using our SDKs
* Run workflows in your own infrastructure
* Full control over execution environment
* Custom orchestration and monitoring

### Integration Flexibility

* TypeScript and Python SDK support
* REST API access
* Custom workflow orchestration
* CI/CD pipeline integration

### Testing and Validation

* Real-time workflow testing
* Input/output validation
* Error handling and debugging
* Performance monitoring

## Core Concepts

### Workflow Creation

* Visual builder interface
* LLM chain construction
* Conditional logic
* Error handling

### Workflow Schemas

* Type-safe workflow definitions
* Version control
* Deployment management
* Custom orchestration support

### Production Runtime

* Autoblocks-hosted execution
* SDK-based local execution
* Infrastructure management
* Monitoring and analytics

### Testing and Monitoring

* Real-time testing
* Performance tracking
* Error analysis
* Usage patterns

## Getting Started

1. **Create Your Workflow**
   * Use the visual builder to design your LLM chain
   * Test and validate in real-time
   * Save and version your workflow

2. **Choose Production Runtime**
   * **Autoblocks Runtime**: Run directly on our infrastructure
   * **SDK Runtime**: Pull the schema and run in your infrastructure


# llms.txt
Source: https://docs.autoblocks.ai/v2/llms/llms


# llms-full
Source: https://docs.autoblocks.ai/v2/llms/llms-full


# V1 to V2 Migration Guide
Source: https://docs.autoblocks.ai/v2/migrating-to-v2/v1-to-v2-migration-guide

Learn how to migrate from Autoblocks V1 to V2.

## Overview

Autoblocks V2 represents a complete reimagining of the platform, designed to deliver a dramatically improved experience for both technical and non-technical stakeholders. We've streamlined workflows, enhanced performance, and introduced powerful new capabilities that make building, testing, and deploying AI applications faster and more intuitive than ever before.

## Prerequisites

Before you begin, contact [us](mailto:support@autoblocks.ai) to get access to the V2 platform.

You will also need to set the `AUTOBLOCKS_V2_API_KEY` environment variable to your V2 API key.

## Tracing

Autoblocks V2 introduces a completely overhauled tracing system built on top of OpenTelemetry, providing industry-standard observability for your AI applications. This new architecture offers enhanced performance, better integration capabilities, and more comprehensive insights into your AI workflows.

Key improvements in V2 tracing include:

* **OpenTelemetry Integration**: Native support for OpenTelemetry standards
* **Enhanced Performance**: Faster trace collection and processing
* **Better Debugging**: More detailed span information and error tracking
* **AI-Specific Instrumentation**: Automatic tracing of LLM calls with token usage and performance metrics

To get started with the new tracing system, see our comprehensive documentation:

* [Tracing Overview](/v2/guides/tracing/overview)
* [TypeScript Quick Start](/v2/guides/tracing/typescript/quick-start)
* [Python Quick Start](/v2/guides/tracing/python/quick-start)

## Testing

Testing in Autoblocks V2 remains largely unchanged from V1, with only minor updates required for migration. The core testing framework and API stay the same, making this one of the smoothest transitions in your migration process.

**What's Changed:**

* **Import Updates**: Update your imports to use the V2 client
* **Tighter Tracing Integration**: Testing is now more closely integrated with the tracing system for better observability

**Migration Requirements:**

1. **Update Imports**: Change your import statements to use the V2 client package
2. **Initialize Tracer**: You'll need to initialize the tracer to take advantage of the improved testing and tracing integration

**What Stays the Same:**

* Test case structure and format
* Evaluator API and functionality
* Test suite configuration
* CI/CD integration patterns

For detailed setup instructions and examples, see our testing documentation:

* [Testing Overview](/v2/guides/testing/overview)
* [TypeScript Quick Start](/v2/guides/testing/typescript/quick-start)
* [Python Quick Start](/v2/guides/testing/python/quick-start)
* [CI/CD Integration](/v2/guides/testing/ci)

## Prompt Management

Prompt Management in Autoblocks V2 maintains the same core functionality you know from V1, but is now organized around an app-based structure for better organization and scalability. The API and workflow remain familiar, making migration straightforward.

**What's New:**

* **App-Based Organization**: Prompts are now structured and organized by application
* **Enhanced Type Safety**: Improved autocomplete and type checking across TypeScript and Python
* **Better Scalability**: Cleaner organization for managing prompts across multiple projects

**What Stays the Same:**

* Core prompt management API and functionality
* Version control and deployment patterns
* Template rendering and parameter handling
* CI/CD integration capabilities

The migration process is smooth since the underlying prompt management concepts remain unchanged - you'll primarily need to adapt to the new app-based organization structure.

For detailed setup and migration instructions, see our prompt management documentation:

* [Prompt Management Overview](/v2/guides/prompt-management/overview)
* [TypeScript Quick Start](/v2/guides/prompt-management/typescript/quick-start)
* [Python Quick Start](/v2/guides/prompt-management/python/quick-start)

## Datasets

Datasets in Autoblocks V2 provide flexible test case and data management with enhanced schema versioning and organization capabilities. The system supports both programmatic and web-based management approaches.

For comprehensive information on working with datasets, including schema management, dataset splits, and integration options, please refer to our API reference documentation:

* [Datasets Overview](/v2/guides/datasets/overview)
* [API Reference](/api-reference/datasets/list-datasets)

## Human Review

Human Review in Autoblocks V2 maintains the familiar experience you know from V1, while introducing powerful new capabilities for better customization and collaboration.

**What Stays the Same:**

* Core human review workflow and interface
* Review job creation and management
* Evaluation and scoring processes

**What's New and Improved:**

* **Configurable UI Fields**: Customize which fields are displayed in the review interface for a cleaner, more focused experience
* **Multiple Rubrics per App**: Create and manage multiple evaluation rubrics within a single application for different review scenarios
* **Multi-Person Assignment**: Assign review jobs to multiple reviewers simultaneously for collaborative evaluation and consensus building

These enhancements make human review more flexible and scalable while preserving the intuitive workflow that teams already know and rely on.

* [Human Review Overview](/v2/guides/human-review/overview)