Offline Evaluations
AI-powered applications are non-deterministic and require thorough testing to ensure they are trustworthy and reliable.
- How do you gain confidence your AI system will perform as expected in the wild?
- How do you know if a change improves or degrades output quality?
- How do you build conviction that your product is ready to be deployed to production?
Frequent and rigorous testing and evaluation of your AI product is critical to answering these questions.
Our Testing SDKs empower developers to define and execute tests seamlessly, locally or in a CI/CD pipeline. Tests can be standalone scripts or part of a comprehensive test framework.
Additionally, our CLI enables product engineers to rapidly test their products while iterating on any part of their AI system.
Configuring evaluators
For further reading, view full documentation here.
When analyzing the performance of an AI integration, you must evaluate its effectiveness. Evaluations are proxies for AI output quality.
Is it responding professionally? Is it saying anything malicious? Is it responding with factual information? These questions can be answered by defining evaluators, the building blocks for checking if your product behaves as it should.
There are various types of evaluators. Common ones are:
- Rule-based: evaluations for things like formatting, substrings, or character count.
- LLM judges: LLMs evaluating LLM outputs. Meta, right?
Example rule-based evaluator:
from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold
class HasSubstring(BaseTestEvaluator):
id = "has-substring"
def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
score = 1 if test_case.expected_substring in output else 0
return Evaluation(
score=score,
threshold=Threshold(gte=1),
)
Example LLM judge evaluator:
from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold
class IsProfessionalTone(BaseTestEvaluator):
id = "is-professional-tone"
# Since this evaluator makes calls to an external service (openai),
# restrict how many evaluations can be made concurrently
# with this evaluator.
max_concurrency = 2
prompt = """Please evaluate the provided text for its professionalism in the context of formal communication.
Consider the following criteria in your assessment:
Tone and Style: Respectful, objective, and appropriately formal tone without bias or excessive emotionality.
Grammar and Punctuation: Correct grammar, punctuation, and capitalization.
Based on these criteria, provide a binary response where:
0 indicates the text does not maintain a professional tone.
1 indicates the text maintains a professional tone.
No further explanation or summary is required; just provide the number that represents your assessment.
"""
async def _score_content(self, content: str) -> int:
...details omitted utilizing prompt...
async def evaluate_test_case(
self,
test_case: MyTestCase,
output: str,
) -> Evaluation:
score = await self._score_content(output)
return Evaluation(score=score)
- View example JavaScript offline evaluators on GitHub
- View example Python offline evaluators on GitHub
Setup your test suite
After defining your evaluators, we can utilize these evaluators in a test suite to validate the behavior of your AI product. Autoblocks Testing SDK can be used to create a test that integrates with Autoblocks.
For further reading, view full documentation here.
from autoblocks.testing.run import run_test_suite
# import your evaluators, test cases, and test function
run_test_suite(
id="my-test-suite",
test_cases=gen_test_cases(),
evaluators=[IsProfessionalTone(), HasSubstring()],
fn=test_fn,
)
View example test suites on GitHub:
Executing test suites can be done through our CLI:
# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py
Test suites can also be executed in your CI/CD pipeline, allowing you to automate the testing process and trigger test runs from the Autoblocks UI. Learn more about running tests in CI.
Out of box evaluators
Autoblocks provides a set of evaluators that can be used out of the box. These evaluators are designed to be easily integrated into your test suite and can help you get started with testing your AI-powered applications.
Each evaluator below lists the custom properties and methods that need to be implemented to use the evaluator in your test suite.
You must set the id
property, which is a unique identifier for the evaluator.
For more details on additional properties that can be set on evaluators, such as max_concurrency
, see the BaseTestEvaluator documentation.
All of the code snippets can be run by following the instructions in the Quick Start guide.
Logic Based
LLM Judges
Ragas
- Answer Correctness
- Answer Relevancy
- Answer Semantic Similarity
- Context Entities Recall
- Context Precision
- Context Recall
- Context Relevancy
- Faithfulness
Is Equals
The IsEquals
evaluator checks if the expected output equals the actual output.
Scores 1 if equal, 0 otherwise.
- Name
test_case_mapper
- Type
- Callable[[BaseTestCase], str]
- Required
- Description
Map your test case to a string for comparison.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseIsEquals
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class IsEquals(BaseIsEquals[TestCase, str]):
id = "is-equals"
def test_case_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hello world",
),
TestCase(
input="hi world",
expected_output="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsEquals()],
)
Is Valid JSON
The IsValidJSON
evaluator checks if the output is valid JSON.
Scores 1 if it is valid, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to the string that you want to check is valid JSON.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseIsValidJSON
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class IsValidJSON(BaseIsValidJSON[TestCase, str]):
id = "is-valid-json"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
),
TestCase(
input='{"hello": "world"}'
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsValidJSON()],
)
Has All Substrings
The HasAllSubstrings
evaluator checks if the output contains all the expected substrings.
Scores 1 if all substrings are present, 0 otherwise.
- Name
test_case_mapper
- Type
- Callable[[BaseTestCase], list[str]]
- Required
- Description
Map your test case to a list of strings to check for in the output.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseHasAllSubstrings
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_substrings: list[str]
def hash(self) -> str:
return md5(self.input)
class HasAllSubstrings(BaseHasAllSubstrings[TestCase, str]):
id = "has-all-substrings"
def test_case_mapper(self, test_case: TestCase) -> list[str]:
return test_case.expected_substrings
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_substrings=["hello", "world"],
)
],
fn=lambda test_case: test_case.input,
evaluators=[HasAllSubstrings()],
)
LLM Judge
The LLMJudge
evaluator makes it easy to implement your own custom evaluator using an LLM as a judge.
- Name
score_choices
- Type
- list[ScoreChoice]
- Required
- Description
The choices for the LLM judge to use when answering.
- Name
make_prompt
- Type
- Callable[[TestCaseType, OutputType, list[EvaluationOverride]], str]
- Required
- Description
The prompt passed to the LLM judge. Should be poised as a question.
- Name
threshold
- Type
- Threshold
- Description
The threshold for the evaluator.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4-turbo
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to fetch for the LLM judge and pass to
make_prompt
. Defaults to 0.
from dataclasses import dataclass
from textwrap import dedent
from autoblocks.testing.evaluators import BaseLLMJudge
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import EvaluationOverride
from autoblocks.testing.models import Threshold
from autoblocks.testing.models import ScoreChoice
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class IsFriendly(BaseLLMJudge[TestCase, str]):
id = "is-friendly"
threshold=Threshold(
gte=1
)
score_choices = [
ScoreChoice(name="Friendly", value=1),
ScoreChoice(name="Not friendly", value=0),
]
def make_prompt(self, test_case: TestCase, output: str, recent_overrides: list[EvaluationOverride]) -> str:
return dedent(
f"""
Is the output friendly?
[Output]
{output}
""").strip()
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsFriendly()],
)
Automatic Battle
The AutomaticBattle
evaluator compares the output of your test function to a baseline maintained by Autoblocks.
On the first run, the baseline will be set to the output of the test function.
On subsequent runs, if the evaluator determines that the new output is better, the baseline will be updated to that output.
If you would like to provide your own baseline, use the ManualBattle
evaluator instead.
Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.
- Name
criteria
- Type
- str
- Required
- Description
The criteria the LLM should use when comparing the output to the baseline.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
You must set the OPENAI_API_KEY
environment variable to use the AutomaticBattle
evaluator.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAutomaticBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class Battle(BaseAutomaticBattle[TestCase, str]):
id = "battle"
criteria = "Choose the best greeting." # Replace with your own criteria
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Battle()],
)
Manual Battle
The ManualBattle
evaluator compares on output to a baseline based on the criteria you provide.
If you would like Autoblocks to automatically manage the baseline, use the AutomaticBattle
evaluator instead.
Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.
- Name
criteria
- Type
- str
- Required
- Description
The criteria the LLM should use when comparing the output to the baseline.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
- Name
baseline_mapper
- Type
- Callable[[TestCaseType], str]
- Required
- Description
Map the baseline ground truth from your test case for comparison.
You must set the OPENAI_API_KEY
environment variable to use the ManualBattle
evaluator.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseManualBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class Battle(BaseManualBattle[TestCase, str]):
id = "battle"
criteria = "Choose the best greeting." # Replace with your own criteria
def output_mapper(self, output: str) -> str:
return output
def baseline_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hi world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Battle()],
)
Accuracy
The Accuracy
evaluator checks if the output is accurate compared to an expected output.
Scores 1 if the output is accurate, 0.5 if somewhat accurate, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
expected_output_mapper
- Type
- Callable[[TestCaseType], str]
- Required
- Description
Map the test case to the expected output string.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4-turbo
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
- Name
example_expected_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the expected output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAccuracy
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class Accuracy(BaseAccuracy[TestCase, str]):
id = "accuracy"
def output_mapper(self, output: str) -> str:
return output
def expected_output_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Accuracy()],
)
NSFW
The NSFW
evaluator checks if the output is considered safe for work.
Scores 1 if the output is safe for work, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4-turbo
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseNSFW
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class NSFW(BaseNSFW[TestCase, str]):
id = "nsfw"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[NSFW()],
)
Toxicity
The Toxicity
evaluator checks if the output is considered safe for work.
Scores 1 if the output is not toxic, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4-turbo
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseToxicity
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class Toxicity(BaseToxicity[TestCase, str]):
id = "toxicity"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[Toxicity()],
)
Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. We have built wrappers around the metrics to make integration with Autoblocks seamless.
Available Ragas evaluators:
BaseRagasAnswerCorrectness
evaluates the accuracy of the generated answer when compared to the ground truth.BaseRagasAnswerRelevancy
focuses on assessing how pertinent the generated answer is to the given prompt.BaseRagasAnswerSemanticSimilarity
evaluates the assessment of the semantic resemblance between the generated answer and the ground truth.BaseRagasContextEntitiesRecall
evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.BaseRagasContextPrecision
evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not.BaseRagasContextRecall
evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.BaseRagasContextRelevancy
evaluates the relevancy of the retrieved context, calculated based on both the question and contextsBaseRagasFaithfulness
evaluates the factual consistency of the generated answer against the given context.
The Ragas evaluators are only available in the Python SDK. You must install Ragas (pip install ragas
) before using these evaluators.
- Name
question_mapper
- Type
- Callable[[TestCaseType, OutputType], str]
- Required
- Description
Map your test case or output to the question passed to Ragas.
- Name
answer_mapper
- Type
- Callable[OutputType], str]
- Required
- Description
Map your output to the answer passed to Ragas.
- Name
contexts_mapper
- Type
- Callable[[TestCaseType, OutputType], str]
- Required
- Description
Map your test case or output to the contexts passed to Ragas.
- Name
ground_truth_mapper
- Type
- Callable[[TestCaseType, OutputType], str]
- Required
- Description
Map your test case or output to the ground truth passed to Ragas.
- Name
llm
- Type
- Any
- Description
Custom LLM for the evaluation.
Read More: https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html
- Name
embeddings
- Type
- Any
- Description
Custom embeddings model for the evaluation.
Read More: https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html
- Name
threshold
- Type
- Threshold
- Description
The threshold for the evaluation used to determine pass/fail.
You must set the OPENAI_API_KEY
environment variable or the llm
and embeddings
properties to use the Ragas evaluators.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseRagasAnswerCorrectness
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import Threshold
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
question: str
expected_answer: str
def hash(self) -> str:
return md5(self.question)
@dataclass
class Output:
answer: str
contexts: list[str]
# You can use any of the Ragas evaluators listed here:
# https://docs.autoblocks.ai/testing/offline-evaluations#out-of-box-evaluators-ragas
class AnswerCorrectness(BaseRagasAnswerCorrectness[TestCase, Output]):
id = "answer-correctness"
threshold = Threshold(gte=1)
# You can omit these properties if you'd like to use OpenAI.
# Just ensure you have the OPENAI_API_KEY environment variable set.
# If you would like to use a custom LLM and embeddings model follow the instructions below and set them here.
# Any LLM: https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html
# Azure OpenAI: https://docs.ragas.io/en/stable/howtos/customisations/azure-openai.html#configuring-them-for-azure-openai-endpoints
llm = None
embeddings = None
def question_mapper(self, test_case: TestCase, output: Output) -> str:
return test_case.question
def answer_mapper(self, output: Output) -> str:
return output.answer
def contexts_mapper(self, test_case: TestCase, output: Output) -> list[str]:
return output.contexts
def ground_truth_mapper(self, test_case: TestCase, output: str) -> str:
return test_case.expected_answer
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
question="How tall is the Eiffel Tower?",
expected_answer="300 meters"
)
],
fn=lambda test_case: Output(
answer="300 meters",
contexts=["The Eiffel tower stands 300 meters tall."],
),
evaluators=[AnswerCorrectness()],
)