Quick Start

Overview

Autoblocks Testing enables you to declaratively define tests for your LLM application and execute them either locally or in a CI/CD pipeline. Your tests can exist in a standalone script or be executed as part of a larger test framework.

run_test_suite(
  id="my-test-suite",
  test_cases=gen_test_cases(),
  evaluators=[HasAllSubstrings(), IsFriendly()],
  fn=test_fn,
)

Getting Started

Install the SDK

poetry add autoblocksai

Define your test case schema

Your test case schema should contain all of the properties necessary to run your test function and to then make assertions on the output via your evaluators. This schema can be anything you want in order to facilitate testing your application.

import dataclasses

from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.util import md5

@dataclasses.dataclass
class MyTestCase(BaseTestCase):
    """
    A test case can be any class that subclasses BaseTestCase.

    This example is a dataclass, but it could also be a pydantic model,
    plain Python class, etc.
    """
    input: str
    expected_substrings: list[str]

    def hash(self) -> str:
        """
        This hash serves as a unique identifier for a test case throughout its lifetime.

        Required to be implemented by subclasses of BaseTestCase.
        """
        return md5(self.input)

Implement a function to test

This function should take an instance of a test case and return an output. The function can be synchronous or asynchronous and the output can be anything: a string, a number, a complex object, etc.

For this example, we are splitting the test case's input property on its hyphens and randomly discarding some of the substrings to simulate failures on the has-all-substrings evaluator: see Create an evaluator below.

import random
import asyncio

async def test_fn(test_case: MyTestCase) -> str:
    """ This could also be a synchronous function. """
    # Simulate doing work
    await asyncio.sleep(random.random())

    substrings = test_case.input.split("-")
    if random.random() < 0.2:
        # Remove a substring randomly. This will cause about 20% of the test cases to fail
        # the "has-all-substrings" evaluator.
        substrings.pop()

    return "-".join(substrings)

Create an evaluator

Evaluators allow you to attach an Evaluation to a test case's output, where the output is the result of running the test case through the function you are testing. Your test suite can have multiple evaluators. The evaluation method that you implement on the evaluator will have access to both the test case instance and the output of the test function over the given test case. Your evaluation method can be synchronous or asynchronous, but it must return an instance of Evaluation.

The evaluation must have a score between 0 and 1, and you can optionally attach a Threshold that describes the range the score must be in in order to be considered passing. If no threshold is attached, the score is reported and the pass / fail status is undefined. Evaluations can also have metadata attached to them, which can be useful for providing additional context when an evaluation fails.

For this example we'll define two evaluators:

import random
import asyncio

from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold

class HasAllSubstrings(BaseTestEvaluator):
    """
    An evaluator is a class that subclasses BaseTestEvaluator.

    It must specify an ID, which is a unique identifier for the evaluator.
    """
    id = "has-all-substrings"

    def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
        """
        Evaluates the output of a test case.

        Required to be implemented by subclasses of BaseTestEvaluator.
        This method can be synchronous or asynchronous.
        """
        missing_substrings = [s for s in test_case.expected_substrings if s not in output]
        score = 0 if missing_substrings else 1
        return Evaluation(
            score=score,
            # If the score is not greater than or equal to 1,
            # this evaluation will be marked as a failure.
            threshold=Threshold(gte=1),
            metadata=dict(
                # Include the missing substrings as metadata
                # so that we can easily see which strings were
                # missing when viewing a failed evaluation
                # in the Autoblocks UI.
                missing_substrings=missing_substrings,
            ),
        )

class IsFriendly(BaseTestEvaluator):
    id = "is-friendly"

    # The maximum number of concurrent calls to `evaluate_test_case` allowed for this evaluator.
    # Useful to avoid rate limiting from external services, such as an LLM provider.
    max_concurrency = 5  

    async def get_score(self, output: str) -> float:
        # Simulate doing work
        await asyncio.sleep(random.random())

        # Simulate a friendliness score, e.g. as determined by an LLM.
        return random.random()

    async def evaluate_test_case(self, test_case: BaseTestCase, output: str) -> Evaluation:
        """
        This can also be an async function. This is useful if you are interacting
        with an external service that requires async calls, such as OpenAI,
        or if the evaluation you are performing could benefit from concurrency.
        """
        score = await self.get_score(output)
        return Evaluation(
            score=score,
            # Evaluations don't need thresholds attached to them.
            # In this case, the evaluation will just consist of the score.
        )

Create a test suite

We now have all of the pieces necessary to run a test suite. Below we'll generate some toy test cases in the schema we defined above, where the input is a random UUID and its expected substrings are the substrings of the UUID when split by "-":

import uuid

from autoblocks.testing.run import run_test_suite

def gen_test_cases(n: int) -> list[MyTestCase]:
    test_cases = []
    for _ in range(n):
        random_id = str(uuid.uuid4())
        test_cases.append(
            MyTestCase(
                input=random_id,
                expected_substrings=random_id.split("-"),
            ),
        )
    return test_cases

run_test_suite(
    id="my-test-suite",
    fn=test_fn,
    test_cases=gen_test_cases(400),
    evaluators=[
        HasAllSubstrings(),
        IsFriendly(),
    ],
    # The maximum number of test cases that can be running
    # concurrently through `fn`. Useful to avoid rate limiting
    # from external services, such as an LLM provider.
    max_test_case_concurrency=10,
)

Run the test suite locally

To execute this test suite, first get your local testing API key from the settings page and set it as an environment variable:

export AUTOBLOCKS_API_KEY=...

Make sure you've followed our CLI setup instructions and then run the following:

# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py

The autoblocks testing exec command will show the progress of all test suites in your terminal and also send the results to Autoblocks:

You can view details of the results by clicking on the link displayed in the terminal or by visiting the test suites page in the Autoblocks platform.

Examples

To see a more complete example, see below: