Offline Evaluations

AI-powered applications are non-deterministic and require thorough testing to ensure they are trustworthy and reliable.

  • How do you gain confidence your AI system will perform as expected in the wild?
  • How do you know if a change improves or degrades output quality?
  • How do you build conviction that your product is ready to be deployed to production?

Frequent and rigorous testing and evaluation of your AI product is critical to answering these questions.

Our Testing SDKs empower developers to define and execute tests seamlessly, locally or in a CI/CD pipeline. Tests can be standalone scripts or part of a comprehensive test framework.

Additionally, our CLI enables product engineers to rapidly test their products while iterating on any part of their AI system.

Configuring evaluators

When analyzing the performance of an AI integration, you must evaluate its effectiveness. Evaluations are proxies for AI output quality.

Is it responding professionally? Is it saying anything malicious? Is it responding with factual information? These questions can be answered by defining evaluators, the building blocks for checking if your product behaves as it should.

There are various types of evaluators. Common ones are:

  • Rule-based: evaluations for things like formatting, substrings, or character count.
  • LLM judges: LLMs evaluating LLM outputs. Meta, right?

Example rule-based evaluator:

class HasSubstring(BaseTestEvaluator):
  id = "has-substring"

  def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
      score = 1 if test_case.expected_substring in output else 0
      return Evaluation(

Example LLM judge evaluator:

class IsProfessionalTone(BaseTestEvaluator):
    id = "is-professional-tone"

    # Since this evaluator makes calls to an external service (openai),
    # restrict how many evaluations can be made concurrently
    # with this evaluator.
    max_concurrency = 2

    prompt = """Please evaluate the provided text for its professionalism in the context of formal communication.
Consider the following criteria in your assessment:

Tone and Style: Respectful, objective, and appropriately formal tone without bias or excessive emotionality.
Grammar and Punctuation: Correct grammar, punctuation, and capitalization.
Based on these criteria, provide a binary response where:

0 indicates the text does not maintain a professional tone.
1 indicates the text maintains a professional tone.
No further explanation or summary is required; just provide the number that represents your assessment.

    async def _score_content(self, content: str) -> int:
        ...details omitted utilizing prompt...

    async def evaluate_test_case(
        test_case: MyTestCase,
        output: str,
    ) -> Evaluation:
        score = await self._score_content(output)
        return Evaluation(score=score)

Setup your test suite

After defining your evaluators, we can utilize these evaluators in a test suite to validate the behavior of your AI product. Autoblocks Testing SDK can be used to create a test that integrates with Autoblocks.

  evaluators=[IsProfessionalTone(), HasSubstring()],

Executing test suites can be done through our CLI:

# Assumes you've saved the above code in a file called
npx autoblocks testing exec -m "my first run" -- python3