Offline Evaluations

AI-powered applications are non-deterministic and require thorough testing to ensure they are trustworthy and reliable.

  • How do you gain confidence your AI system will perform as expected in the wild?
  • How do you know if a change improves or degrades output quality?
  • How do you build conviction that your product is ready to be deployed to production?

Frequent and rigorous testing and evaluation of your AI product is critical to answering these questions.

Our Testing SDKs empower developers to define and execute tests seamlessly, locally or in a CI/CD pipeline. Tests can be standalone scripts or part of a comprehensive test framework.

Additionally, our CLI enables product engineers to rapidly test their products while iterating on any part of their AI system.

Configuring evaluators

When analyzing the performance of an AI integration, you must evaluate its effectiveness. Evaluations are proxies for AI output quality.

Is it responding professionally? Is it saying anything malicious? Is it responding with factual information? These questions can be answered by defining evaluators, the building blocks for checking if your product behaves as it should.

There are various types of evaluators. Common ones are:

  • Rule-based: evaluations for things like formatting, substrings, or character count.
  • LLM judges: LLMs evaluating LLM outputs. Meta, right?

Example rule-based evaluator:

from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold

class HasSubstring(BaseTestEvaluator):
  id = "has-substring"

  def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
      score = 1 if test_case.expected_substring in output else 0
      return Evaluation(
          score=score,
          threshold=Threshold(gte=1),
      )

Example LLM judge evaluator:

from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold

class IsProfessionalTone(BaseTestEvaluator):
    id = "is-professional-tone"

    # Since this evaluator makes calls to an external service (openai),
    # restrict how many evaluations can be made concurrently
    # with this evaluator.
    max_concurrency = 2

    prompt = """Please evaluate the provided text for its professionalism in the context of formal communication.
Consider the following criteria in your assessment:

Tone and Style: Respectful, objective, and appropriately formal tone without bias or excessive emotionality.
Grammar and Punctuation: Correct grammar, punctuation, and capitalization.
Based on these criteria, provide a binary response where:

0 indicates the text does not maintain a professional tone.
1 indicates the text maintains a professional tone.
No further explanation or summary is required; just provide the number that represents your assessment.
"""

    async def _score_content(self, content: str) -> int:
        ...details omitted utilizing prompt...

    async def evaluate_test_case(
        self,
        test_case: MyTestCase,
        output: str,
    ) -> Evaluation:
        score = await self._score_content(output)
        return Evaluation(score=score)

Setup your test suite

After defining your evaluators, we can utilize these evaluators in a test suite to validate the behavior of your AI product. Autoblocks Testing SDK can be used to create a test that integrates with Autoblocks.

from autoblocks.testing.run import run_test_suite
# import your evaluators, test cases, and test function

run_test_suite(
  id="my-test-suite",
  test_cases=gen_test_cases(),
  evaluators=[IsProfessionalTone(), HasSubstring()],
  fn=test_fn,
)

View example test suites on GitHub:

Executing test suites can be done through our CLI:

# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py

Out of box evaluators

Autoblocks provides a set of evaluators that can be used out of the box. These evaluators are designed to be easily integrated into your test suite and can help you get started with testing your AI-powered applications.

Each evaluator below lists the custom properties and methods that need to be implemented to use the evaluator in your test suite. You must set the id property, which is a unique identifier for the evaluator. For more details on additional properties that can be set on evaluators, such as max_concurrency, see the BaseTestEvaluator documentation.

All of the code snippets can be run by following the instructions in the Quick Start guide.

Logic Based

LLM Judges

Ragas

Is Equals

The IsEquals evaluator checks if the expected output equals the actual output.

Scores 1 if equal, 0 otherwise.

  • Name
    test_case_mapper
    Type
    Callable[[BaseTestCase], str]
    Required
    Description

    Map your test case to a string for comparison.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map your output to a string for comparison.

from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseIsEquals
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str
    expected_output: str

    def hash(self) -> str:
        return md5(self.input)

class IsEquals(BaseIsEquals[TestCase, str]):
    id = "is-equals"

    def test_case_mapper(self, test_case: TestCase) -> str:
        return test_case.expected_output

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
            expected_output="hello world",
        ),
        TestCase(
            input="hi world",
            expected_output="hello world",
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[IsEquals()],
)

Is Valid JSON

The IsValidJSON evaluator checks if the output is valid JSON.

Scores 1 if it is valid, 0 otherwise.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map your output to the string that you want to check is valid JSON.

from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseIsValidJSON
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str

    def hash(self) -> str:
        return md5(self.input)

class IsValidJSON(BaseIsValidJSON[TestCase, str]):
    id = "is-valid-json"

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
        ),
        TestCase(
            input='{"hello": "world"}'
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[IsValidJSON()],
)

Has All Substrings

The HasAllSubstrings evaluator checks if the output contains all the expected substrings.

Scores 1 if all substrings are present, 0 otherwise.

  • Name
    test_case_mapper
    Type
    Callable[[BaseTestCase], list[str]]
    Required
    Description

    Map your test case to a list of strings to check for in the output.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map your output to a string for comparison.

from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseHasAllSubstrings
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str
    expected_substrings: list[str]

    def hash(self) -> str:
        return md5(self.input)

class HasAllSubstrings(BaseHasAllSubstrings[TestCase, str]):
    id = "has-all-substrings"

    def test_case_mapper(self, test_case: TestCase) -> list[str]:
        return test_case.expected_substrings

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
            expected_substrings=["hello", "world"],
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[HasAllSubstrings()],
)

Assertions (Rubric/Rules)

The Assertions evaluator enables you to define a set of assertions or rules that your output must satisfy.

  • Name
    evaluate_assertions
    Type
    Callable[[BaseTestCase, Any], Optional[List[Assertion]]]
    Required
    Description

    Implement your logic to evaluate the assertions.

from typing import Optional
from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseAssertions
from autoblocks.testing.models import Assertion
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCaseCriterion:
    criterion: str
    required: bool


@dataclass
class TestCase(BaseTestCase):
    input: str
    assertions: Optional[list[TestCaseCriterion]] = None

    def hash(self) -> str:
        return md5(self.input)


class AssertionsEvaluator(BaseAssertions[TestCase, str]):
    id = "assertions"

    def evaluate_assertions(self, test_case: TestCase, output: str) -> list[Assertion]:
        if test_case.assertions is None:
            return []
        result = []
        for assertion in test_case.assertions:
            result.append(
                Assertion(
                    criterion=assertion.criterion,
                    passed=assertion.criterion in output,
                    required=assertion.required,
                )
            )
        return result


run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
            assertions=[
                TestCaseCriterion(criterion="hello", required=True),
                TestCaseCriterion(criterion="world", required=True),
                TestCaseCriterion(criterion="hi", required=False),
            ],
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[AssertionsEvaluator()],
)

LLM Judge

The LLMJudge evaluator makes it easy to implement your own custom evaluator using an LLM as a judge.

  • Name
    score_choices
    Type
    list[ScoreChoice]
    Required
    Description

    The choices for the LLM judge to use when answering.

  • Name
    make_prompt
    Type
    Callable[[TestCaseType, OutputType, list[EvaluationOverride]], str]
    Required
    Description

    The prompt passed to the LLM judge. Should be poised as a question.

  • Name
    threshold
    Type
    Threshold
    Description

    The threshold for the evaluator.

  • Name
    model
    Type
    str
    Description

    The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to gpt-4o.

  • Name
    num_overrides
    Type
    int
    Description

    The number of recent evaluation overrides to fetch for the LLM judge and pass to make_prompt. Defaults to 0.

from dataclasses import dataclass
from textwrap import dedent
from autoblocks.testing.evaluators import BaseLLMJudge
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import EvaluationOverride
from autoblocks.testing.models import Threshold
from autoblocks.testing.models import ScoreChoice
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str

    def hash(self) -> str:
        return md5(self.input)

class IsFriendly(BaseLLMJudge[TestCase, str]):
    id = "is-friendly"
    threshold=Threshold(
        gte=1
    )
    score_choices = [
        ScoreChoice(name="Friendly", value=1),
        ScoreChoice(name="Not friendly", value=0),
    ]

    def make_prompt(self, test_case: TestCase, output: str, recent_overrides: list[EvaluationOverride]) -> str:
        return dedent(
            f"""
                Is the output friendly?

                [Output]
                {output}
            """).strip()

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="Hi how are you?",
        ),
        TestCase(
            input='I hate you!'
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[IsFriendly()],
)

Automatic Battle

The AutomaticBattle evaluator compares the output of your test function to a baseline maintained by Autoblocks. On the first run, the baseline will be set to the output of the test function. On subsequent runs, if the evaluator determines that the new output is better, the baseline will be updated to that output. If you would like to provide your own baseline, use the ManualBattle evaluator instead.

Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.

  • Name
    criteria
    Type
    str
    Required
    Description

    The criteria the LLM should use when comparing the output to the baseline.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map your output to a string for comparison.

from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseAutomaticBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str

    def hash(self) -> str:
        return md5(self.input)

class Battle(BaseAutomaticBattle[TestCase, str]):
    id = "battle"
    criteria = "Choose the best greeting." # Replace with your own criteria

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[Battle()],
)

Manual Battle

The ManualBattle evaluator compares on output to a baseline based on the criteria you provide. If you would like Autoblocks to automatically manage the baseline, use the AutomaticBattle evaluator instead.

Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.

  • Name
    criteria
    Type
    str
    Required
    Description

    The criteria the LLM should use when comparing the output to the baseline.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map your output to a string for comparison.

  • Name
    baseline_mapper
    Type
    Callable[[TestCaseType], str]
    Required
    Description

    Map the baseline ground truth from your test case for comparison.

from dataclasses import dataclass

from autoblocks.testing.evaluators import BaseManualBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str
    expected_output: str

    def hash(self) -> str:
        return md5(self.input)

class Battle(BaseManualBattle[TestCase, str]):
    id = "battle"
    criteria = "Choose the best greeting." # Replace with your own criteria

    def output_mapper(self, output: str) -> str:
        return output

    def baseline_mapper(self, test_case: TestCase) -> str:
        return test_case.expected_output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
            expected_output="hi world",
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[Battle()],
)

Accuracy

The Accuracy evaluator checks if the output is accurate compared to an expected output.

Scores 1 if the output is accurate, 0.5 if somewhat accurate, 0 otherwise.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map the output to a string to pass to the LLM judge.

  • Name
    expected_output_mapper
    Type
    Callable[[TestCaseType], str]
    Required
    Description

    Map the test case to the expected output string.

  • Name
    model
    Type
    str
    Description

    The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to gpt-4o.

  • Name
    num_overrides
    Type
    int
    Description

    The number of recent evaluation overrides to use as examples for the LLM judge. Defaults to 0.

  • Name
    example_output_mapper
    Type
    Callable[[EvaluationOverride], str]
    Description

    Map an EvaluationOverride to a string representation of the output. This gets passed to the LLM judge as an example. If you set num_overrides to a non-zero number, this method must be implemented.

  • Name
    example_expected_output_mapper
    Type
    Callable[[EvaluationOverride], str]
    Description

    Map an EvaluationOverride to a string representation of the expected output. This gets passed to the LLM judge as an example. If you set num_overrides to a non-zero number, this method must be implemented.

from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAccuracy
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str
    expected_output: str

    def hash(self) -> str:
        return md5(self.input)

class Accuracy(BaseAccuracy[TestCase, str]):
    id = "accuracy"

    def output_mapper(self, output: str) -> str:
        return output

    def expected_output_mapper(self, test_case: TestCase) -> str:
        return test_case.expected_output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="hello world",
            expected_output="hello world",
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[Accuracy()],
)

NSFW

The NSFW evaluator checks if the output is considered safe for work.

Scores 1 if the output is safe for work, 0 otherwise.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map the output to a string to pass to the LLM judge.

  • Name
    model
    Type
    str
    Description

    The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to gpt-4o.

  • Name
    num_overrides
    Type
    int
    Description

    The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.

  • Name
    example_output_mapper
    Type
    Callable[[EvaluationOverride], str]
    Description

    Map an EvaluationOverride to a string representation of the output. This gets passed to the LLM judge as an example. If you set num_overrides to a non-zero number, this method must be implemented.

from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseNSFW
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str

    def hash(self) -> str:
        return md5(self.input)

class NSFW(BaseNSFW[TestCase, str]):
    id = "nsfw"

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="Hi how are you?",
        ),
        TestCase(
            input='I hate you!'
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[NSFW()],
)

Toxicity

The Toxicity evaluator checks if the output is considered safe for work.

Scores 1 if the output is not toxic, 0 otherwise.

  • Name
    output_mapper
    Type
    Callable[[OutputType], str]
    Required
    Description

    Map the output to a string to pass to the LLM judge.

  • Name
    model
    Type
    str
    Description

    The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to gpt-4o.

  • Name
    num_overrides
    Type
    int
    Description

    The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.

  • Name
    example_output_mapper
    Type
    Callable[[EvaluationOverride], str]
    Description

    Map an EvaluationOverride to a string representation of the output. This gets passed to the LLM judge as an example. If you set num_overrides to a non-zero number, this method must be implemented.

from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseToxicity
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    input: str

    def hash(self) -> str:
        return md5(self.input)

class Toxicity(BaseToxicity[TestCase, str]):
    id = "toxicity"

    def output_mapper(self, output: str) -> str:
        return output

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            input="Hi how are you?",
        ),
        TestCase(
            input='I hate you!'
        )
    ],
    fn=lambda test_case: test_case.input,
    evaluators=[Toxicity()],
)

Ragas

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. We have built wrappers around the metrics to make integration with Autoblocks seamless.

Available Ragas evaluators:

  • BaseRagasLLMContextPrecisionWithReference uses a LLM to measure the proportion of relevant chunks in the retrieved_contexts.
  • BaseRagasNonLLMContextPrecisionWithReference measures the proportion of relevant chunks in the retrieved_contexts without using a LLM.
  • BaseRagasLLMContextRecall evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.
  • BaseRagasNonLLMContextRecall uses non llm string comparison metrics to identify if a retrieved context is relevant or not.
  • BaseRagasContextEntitiesRecall evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.
  • BaseRagasNoiseSensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.
  • BaseRagasResponseRelevancy focuses on assessing how pertinent the generated answer is to the given prompt.
  • BaseRagasFaithfulness measures the factual consistency of the generated answer against the given context.
  • BaseRagasFactualCorrectness compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference.
  • BaseRagasSemanticSimilarity measures the semantic resemblance between the generated answer and the ground truth.
from dataclasses import dataclass

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper  # type: ignore[import-untyped]
from ragas.llms import LangchainLLMWrapper  # type: ignore[import-untyped]

from autoblocks.testing.evaluators import BaseRagasResponseRelevancy
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import Threshold
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5

@dataclass
class TestCase(BaseTestCase):
    question: str
    expected_answer: str

    def hash(self) -> str:
        return md5(self.question)
    
@dataclass
class Output:
    answer: str
    contexts: list[str]

# You can use any of the Ragas evaluators listed here:
# https://docs.autoblocks.ai/testing/offline-evaluations#out-of-box-evaluators-ragas
class ResponseRelevancy(BaseRagasResponseRelevancy[TestCase, Output]):
    id = "response-relevancy"
    threshold = Threshold(gte=1)
    llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
    embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

    def user_input_mapper(self, test_case: TestCase, output: Output) -> str:
        return test_case.question

    def response_mapper(self, output: Output) -> str:
        return output.answer

    def retrieved_contexts_mapper(self, test_case: TestCase, output: Output) -> list[str]:
        return output.contexts

run_test_suite(
    id="my-test-suite",
    test_cases=[
        TestCase(
            question="How tall is the Eiffel Tower?",
            expected_answer="300 meters"
        )
    ],
    fn=lambda test_case: Output(
        answer="300 meters",
        contexts=["The Eiffel tower stands 300 meters tall."],
    ),
    evaluators=[ResponseRelevancy()],
)