Quick Start
Overview
Autoblocks Testing enables you to declaratively define tests for your LLM application and execute them either locally or in a CI/CD pipeline. Your tests can exist in a standalone script or be executed as part of a larger test framework.
run_test_suite(
id="my-test-suite",
test_cases=gen_test_cases(),
evaluators=[HasAllSubstrings(), IsFriendly()],
fn=test_fn,
)
Getting Started
Install the SDK
poetry add autoblocksai
Define your test case schema
Your test case schema should contain all of the properties necessary to run your test function and to then make assertions on the output via your evaluators. This schema can be anything you want in order to facilitate testing your application.
import dataclasses
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.util import md5
@dataclasses.dataclass
class MyTestCase(BaseTestCase):
"""
A test case can be any class that subclasses BaseTestCase.
This example is a dataclass, but it could also be a pydantic model,
plain Python class, etc.
"""
input: str
expected_substrings: list[str]
def hash(self) -> str:
"""
This hash serves as a unique identifier for a test case throughout its lifetime.
Required to be implemented by subclasses of BaseTestCase.
"""
return md5(self.input)
Implement a function to test
This function should take an instance of a test case and return an output. The function can be synchronous or asynchronous and the output can be anything: a string, a number, a complex object, etc.
For this example, we are splitting the test case's input
property on its hyphens and randomly discarding some of the substrings
to simulate failures on the has-all-substrings
evaluator: see Create an evaluator below.
import random
import asyncio
async def test_fn(test_case: MyTestCase) -> str:
""" This could also be a synchronous function. """
# Simulate doing work
await asyncio.sleep(random.random())
substrings = test_case.input.split("-")
if random.random() < 0.2:
# Remove a substring randomly. This will cause about 20% of the test cases to fail
# the "has-all-substrings" evaluator.
substrings.pop()
return "-".join(substrings)
Create an evaluator
Evaluators allow you to attach an Evaluation
to a test case's output,
where the output is the result of running the test case through the function you are testing.
Your test suite can have multiple evaluators.
The evaluation method that you implement on the evaluator will have access to both the test case
instance and the output of the test function over the given test case. Your evaluation method
can be synchronous or asynchronous, but it must return an instance of Evaluation
.
The evaluation must have a score between 0 and 1, and you can optionally attach a
Threshold
that describes the range the score must be in in order
to be considered passing. If no threshold is attached, the score is reported and the pass / fail
status is undefined. Evaluations can also have metadata attached to them, which can be useful
for providing additional context when an evaluation fails.
For this example we'll define two evaluators:
import random
import asyncio
from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold
class HasAllSubstrings(BaseTestEvaluator):
"""
An evaluator is a class that subclasses BaseTestEvaluator.
It must specify an ID, which is a unique identifier for the evaluator.
"""
id = "has-all-substrings"
def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
"""
Evaluates the output of a test case.
Required to be implemented by subclasses of BaseTestEvaluator.
This method can be synchronous or asynchronous.
"""
missing_substrings = [s for s in test_case.expected_substrings if s not in output]
score = 0 if missing_substrings else 1
return Evaluation(
score=score,
# If the score is not greater than or equal to 1,
# this evaluation will be marked as a failure.
threshold=Threshold(gte=1),
metadata=dict(
# Include the missing substrings as metadata
# so that we can easily see which strings were
# missing when viewing a failed evaluation
# in the Autoblocks UI.
missing_substrings=missing_substrings,
),
)
class IsFriendly(BaseTestEvaluator):
id = "is-friendly"
# The maximum number of concurrent calls to `evaluate_test_case` allowed for this evaluator.
# Useful to avoid rate limiting from external services, such as an LLM provider.
max_concurrency = 5
async def get_score(self, output: str) -> float:
# Simulate doing work
await asyncio.sleep(random.random())
# Simulate a friendliness score, e.g. as determined by an LLM.
return random.random()
async def evaluate_test_case(self, test_case: BaseTestCase, output: str) -> Evaluation:
"""
This can also be an async function. This is useful if you are interacting
with an external service that requires async calls, such as OpenAI,
or if the evaluation you are performing could benefit from concurrency.
"""
score = await self.get_score(output)
return Evaluation(
score=score,
# Evaluations don't need thresholds attached to them.
# In this case, the evaluation will just consist of the score.
)
An evaluator can be used across many test suites!
The recommended approach for this is to create your own abstract class that subclasses BaseTestEvaluator
and implements any shared logic, then subclass that abstract class for each test suite.
See the below examples:
Also see the relevant documentation for your language on how to create abstract classes:
Create a test suite
We now have all of the pieces necessary to run a test suite. Below we'll generate some toy test cases in the schema we defined above, where the input is a random UUID and its expected substrings are the substrings of the UUID when split by "-":
import uuid
from autoblocks.testing.run import run_test_suite
def gen_test_cases(n: int) -> list[MyTestCase]:
test_cases = []
for _ in range(n):
random_id = str(uuid.uuid4())
test_cases.append(
MyTestCase(
input=random_id,
expected_substrings=random_id.split("-"),
),
)
return test_cases
run_test_suite(
id="my-test-suite",
fn=test_fn,
test_cases=gen_test_cases(400),
evaluators=[
HasAllSubstrings(),
IsFriendly(),
],
# The maximum number of test cases that can be running
# concurrently through `fn`. Useful to avoid rate limiting
# from external services, such as an LLM provider.
max_test_case_concurrency=10,
)
Run the test suite locally
To execute this test suite, first get your local testing API key from the settings page and set it as an environment variable:
export AUTOBLOCKS_API_KEY=...
Make sure you've followed our CLI setup instructions and then run the following:
# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py
The autoblocks testing exec
command will show the progress of all test suites in your terminal
and also send the results to Autoblocks:
You can view details of the results by clicking on the link displayed in the terminal or by visiting the test suites page in the Autoblocks platform.
Examples
To see a more complete example, see below: