Overview

Autoblocks Testing enables you to declaratively define tests for your LLM application and execute them either locally or in a CI/CD pipeline. Your tests can exist in a standalone script or be executed as part of a larger test framework.

runTestSuite<MyTestCase, string>({
  id: 'my-test-suite',
  testCases: genTestCases(),
  testCaseHash: ['input'],
  evaluators: [new HasAllSubstrings(), new IsFriendly()],
  fn: testFn,
});

Getting Started

Install the SDK

npm install @autoblocks/client
# or
yarn add @autoblocks/client
# or
pnpm add @autoblocks/client

Define your test case schema

Your test case schema should contain all of the properties necessary to run your test function and to then make assertions on the output via your evaluators. This schema can be anything you want in order to facilitate testing your application.

interface MyTestCase {
  input: string;
  expectedSubstrings: string[];
}

Implement a function to test

This function should take an instance of a test case and return an output. The function can be synchronous or asynchronous and the output can be anything: a string, a number, a complex object, etc.

For this example, we are splitting the test case’s input property on its hyphens and randomly discarding some of the substrings to simulate failures on the has-all-substrings evaluator.

async function testFn({ testCase }: { testCase: MyTestCase }): Promise<string> {
  // Simulate doing work
  await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));

  const substrings = testCase.input.split('-');
  if (Math.random() < 0.2) {
    // Remove a substring randomly. This will cause about 20% of the test cases to fail
    // the "has-all-substrings" evaluator.
    substrings.pop();
  }

  return substrings.join('-');
}

Create an evaluator

Evaluators allow you to attach an Evaluation to a test case’s output, where the output is the result of running the test case through the function you are testing. Your test suite can have multiple evaluators. The evaluation method that you implement on the evaluator will have access to both the test case instance and the output of the test function over the given test case. Your evaluation method can be synchronous or asynchronous, but it must return an instance of Evaluation.

The evaluation must have a score between 0 and 1, and you can optionally attach a Threshold that describes the range the score must be in in order to be considered passing. If no threshold is attached, the score is reported and the pass / fail status is undefined. Evaluations can also have metadata attached to them, which can be useful for providing additional context when an evaluation fails.

For this example we’ll define two evaluators:

import {
  BaseTestEvaluator,
  type Evaluation,
} from '@autoblocks/client/testing';

/**
 * An evaluator is a class that subclasses BaseTestEvaluator.
 *
 * It must specify an ID, which is a unique identifier for the evaluator.
 *
 * It has two required type parameters:
 * - TestCaseType: The type of your test cases.
 * - OutputType: The type of the output returned by the function you are testing.
 */
class HasAllSubstrings extends BaseTestEvaluator<MyTestCase, string> {
  id = 'has-all-substrings';

  /**
   * Evaluates the output of a test case.
   *
   * Required to be implemented by subclasses of BaseTestEvaluator.
   * This method can be synchronous or asynchronous.
   */
  evaluateTestCase(args: { testCase: MyTestCase; output: string }): Evaluation {
    const missingSubstrings = args.testCase.expectedSubstrings.filter(
      (s) => !args.output.includes(s),
    );
    const score = missingSubstrings.length ? 0 : 1;

    return {
      score,
      threshold: {
        // If the score is not greater than or equal to 1,
        // this evaluation will be marked as a failure.
        gte: 1,
      },
      metadata: {
        // Include the missing substrings as metadata
        // so that we can easily see which strings were
        // missing when viewing a failed evaluation
        // in the Autoblocks UI.
        missingSubstrings,
      },
    };
  }
}

class IsFriendly extends BaseTestEvaluator<MyTestCase, string> {
  id = 'is-friendly';

  // The maximum number of concurrent calls to `evaluateTestCase` allowed for this evaluator.
  // Useful to avoid rate limiting from external services, such as an LLM provider.
  maxConcurrency = 5;

  async getScore(output: string): Promise<number> {
    // Simulate doing work
    await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));

    // Simulate a friendliness score, e.g. as determined by an LLM.
    return Math.random();
  }

  /**
   * This can also be an async function. This is useful if you are interacting
   * with an external service that requires async calls, such as OpenAI, or if
   * the evaluation you are performing could benefit from concurrency.
   */
  async evaluateTestCase(args: {
    testCase: MyTestCase;
    output: string;
  }): Promise<Evaluation> {
    const score = await this.getScore(args.output);

    return {
      score,
      // Evaluations don't need thresholds attached to them.
      // In this case, the evaluation will just consist of the score.
    };
  }
}

Create a test suite

We now have all of the pieces necessary to run a test suite. Below we’ll generate some toy test cases in the schema we defined above, where the input is a random UUID and its expected substrings are the substrings of the UUID when split by ”-“:

import crypto from 'crypto';
import { runTestSuite } from '@autoblocks/client/testing';

function genTestCases(n: number): MyTestCase[] {
  const testCases: MyTestCase[] = [];
  for (let i = 0; i < n; i++) {
    const randomId = crypto.randomUUID();
    testCases.push({
      input: randomId,
      expectedSubstrings: randomId.split('-'),
    });
  }
  return testCases;
}

(async () => {
  await runTestSuite<MyTestCase, string>({
    id: 'my-test-suite',
    fn: testFn,
    // Specify here either a list of properties that uniquely identify a test case
    // or a function that takes a test case and returns a hash. See the section on
    // hashing test cases for more information.
    testCaseHash: ['input'],
    testCases: genTestCases(400),
    evaluators: [
      new HasAllSubstrings(),
      new IsFriendly(),
    ],
    // The maximum number of test cases that can be running
    // concurrently through `fn`. Useful to avoid rate limiting
    // from external services, such as an LLM provider.
    maxTestCaseConcurrency: 10,
  });
})();

Run the test suite locally

To execute this test suite, first get your local testing API key from the settings page and set it as an environment variable:

export AUTOBLOCKS_API_KEY=...

Make sure you’ve followed our CLI setup instructions and then run the following:

# Assumes you've saved the above code in a file called run.ts
npx autoblocks testing exec -m "my first run" -- npx tsx run.ts

The autoblocks testing exec command will show the progress of all test suites in your terminal and also send the results to Autoblocks:

You can view details of the results by clicking on the link displayed in the terminal or by visiting the test suites page in the Autoblocks platform.

Examples

To see a more complete example, check out our TypeScript example repository.