Offline Evaluations
AI-powered applications are non-deterministic and require thorough testing to ensure they are trustworthy and reliable.
- How do you gain confidence your AI system will perform as expected in the wild?
- How do you know if a change improves or degrades output quality?
- How do you build conviction that your product is ready to be deployed to production?
Frequent and rigorous testing and evaluation of your AI product is critical to answering these questions.
Our Testing SDKs empower developers to define and execute tests seamlessly, locally or in a CI/CD pipeline. Tests can be standalone scripts or part of a comprehensive test framework.
Additionally, our CLI enables product engineers to rapidly test their products while iterating on any part of their AI system.
Configuring evaluators
For further reading, view full documentation here.
When analyzing the performance of an AI integration, you must evaluate its effectiveness. Evaluations are proxies for AI output quality.
Is it responding professionally? Is it saying anything malicious? Is it responding with factual information? These questions can be answered by defining evaluators, the building blocks for checking if your product behaves as it should.
There are various types of evaluators. Common ones are:
- Rule-based: evaluations for things like formatting, substrings, or character count.
- LLM judges: LLMs evaluating LLM outputs. Meta, right?
Example rule-based evaluator:
from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold
class HasSubstring(BaseTestEvaluator):
id = "has-substring"
def evaluate_test_case(self, test_case: MyTestCase, output: str) -> Evaluation:
score = 1 if test_case.expected_substring in output else 0
return Evaluation(
score=score,
threshold=Threshold(gte=1),
)
Example LLM judge evaluator:
from autoblocks.testing.models import BaseTestEvaluator
from autoblocks.testing.models import Evaluation
from autoblocks.testing.models import Threshold
class IsProfessionalTone(BaseTestEvaluator):
id = "is-professional-tone"
# Since this evaluator makes calls to an external service (openai),
# restrict how many evaluations can be made concurrently
# with this evaluator.
max_concurrency = 2
prompt = """Please evaluate the provided text for its professionalism in the context of formal communication.
Consider the following criteria in your assessment:
Tone and Style: Respectful, objective, and appropriately formal tone without bias or excessive emotionality.
Grammar and Punctuation: Correct grammar, punctuation, and capitalization.
Based on these criteria, provide a binary response where:
0 indicates the text does not maintain a professional tone.
1 indicates the text maintains a professional tone.
No further explanation or summary is required; just provide the number that represents your assessment.
"""
async def _score_content(self, content: str) -> int:
...details omitted utilizing prompt...
async def evaluate_test_case(
self,
test_case: MyTestCase,
output: str,
) -> Evaluation:
score = await self._score_content(output)
return Evaluation(score=score)
- View example JavaScript offline evaluators on GitHub
- View example Python offline evaluators on GitHub
Setup your test suite
After defining your evaluators, we can utilize these evaluators in a test suite to validate the behavior of your AI product. Autoblocks Testing SDK can be used to create a test that integrates with Autoblocks.
For further reading, view full documentation here.
from autoblocks.testing.run import run_test_suite
# import your evaluators, test cases, and test function
run_test_suite(
id="my-test-suite",
test_cases=gen_test_cases(),
evaluators=[IsProfessionalTone(), HasSubstring()],
fn=test_fn,
)
View example test suites on GitHub:
Executing test suites can be done through our CLI:
# Assumes you've saved the above code in a file called run.py
npx autoblocks testing exec -m "my first run" -- python3 run.py
Test suites can also be executed in your CI/CD pipeline, allowing you to automate the testing process and trigger test runs from the Autoblocks UI. Learn more about running tests in CI.
Out of box evaluators
Autoblocks provides a set of evaluators that can be used out of the box. These evaluators are designed to be easily integrated into your test suite and can help you get started with testing your AI-powered applications.
Each evaluator below lists the custom properties and methods that need to be implemented to use the evaluator in your test suite.
You must set the id
property, which is a unique identifier for the evaluator.
For more details on additional properties that can be set on evaluators, such as max_concurrency
, see the BaseTestEvaluator documentation.
All of the code snippets can be run by following the instructions in the Quick Start guide.
Logic Based
LLM Judges
Ragas
- LLM Context Precision With Reference
- Non LLM Context Precision With Reference
- LLM Context Recall
- Non LLM Context Recall
- Context Entities Recall
- Noise Sensitivity
- Response Relevancy
- Faithfulness
- Factual Correctness
- Semantic Similarity
Is Equals
The IsEquals
evaluator checks if the expected output equals the actual output.
Scores 1 if equal, 0 otherwise.
- Name
test_case_mapper
- Type
- Callable[[BaseTestCase], str]
- Required
- Description
Map your test case to a string for comparison.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseIsEquals
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class IsEquals(BaseIsEquals[TestCase, str]):
id = "is-equals"
def test_case_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hello world",
),
TestCase(
input="hi world",
expected_output="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsEquals()],
)
Is Valid JSON
The IsValidJSON
evaluator checks if the output is valid JSON.
Scores 1 if it is valid, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to the string that you want to check is valid JSON.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseIsValidJSON
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class IsValidJSON(BaseIsValidJSON[TestCase, str]):
id = "is-valid-json"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
),
TestCase(
input='{"hello": "world"}'
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsValidJSON()],
)
Has All Substrings
The HasAllSubstrings
evaluator checks if the output contains all the expected substrings.
Scores 1 if all substrings are present, 0 otherwise.
- Name
test_case_mapper
- Type
- Callable[[BaseTestCase], list[str]]
- Required
- Description
Map your test case to a list of strings to check for in the output.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseHasAllSubstrings
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_substrings: list[str]
def hash(self) -> str:
return md5(self.input)
class HasAllSubstrings(BaseHasAllSubstrings[TestCase, str]):
id = "has-all-substrings"
def test_case_mapper(self, test_case: TestCase) -> list[str]:
return test_case.expected_substrings
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_substrings=["hello", "world"],
)
],
fn=lambda test_case: test_case.input,
evaluators=[HasAllSubstrings()],
)
Assertions (Rubric/Rules)
The Assertions
evaluator enables you to define a set of assertions or rules that your output must satisfy.
Individual assertions can be marked as not required, and if they are not met, the evaluator will still pass.
- Name
evaluate_assertions
- Type
- Callable[[BaseTestCase, Any], Optional[List[Assertion]]]
- Required
- Description
Implement your logic to evaluate the assertions.
from typing import Optional
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAssertions
from autoblocks.testing.models import Assertion
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCaseCriterion:
criterion: str
required: bool
@dataclass
class TestCase(BaseTestCase):
input: str
assertions: Optional[list[TestCaseCriterion]] = None
def hash(self) -> str:
return md5(self.input)
class AssertionsEvaluator(BaseAssertions[TestCase, str]):
id = "assertions"
def evaluate_assertions(self, test_case: TestCase, output: str) -> list[Assertion]:
if test_case.assertions is None:
return []
result = []
for assertion in test_case.assertions:
result.append(
Assertion(
criterion=assertion.criterion,
passed=assertion.criterion in output,
required=assertion.required,
)
)
return result
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
assertions=[
TestCaseCriterion(criterion="hello", required=True),
TestCaseCriterion(criterion="world", required=True),
TestCaseCriterion(criterion="hi", required=False),
],
)
],
fn=lambda test_case: test_case.input,
evaluators=[AssertionsEvaluator()],
)
LLM Judge
The LLMJudge
evaluator makes it easy to implement your own custom evaluator using an LLM as a judge.
- Name
score_choices
- Type
- list[ScoreChoice]
- Required
- Description
The choices for the LLM judge to use when answering.
- Name
make_prompt
- Type
- Callable[[TestCaseType, OutputType, list[EvaluationOverride]], str]
- Required
- Description
The prompt passed to the LLM judge. Should be poised as a question.
- Name
threshold
- Type
- Threshold
- Description
The threshold for the evaluator.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4o
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to fetch for the LLM judge and pass to
make_prompt
. Defaults to 0.
from dataclasses import dataclass
from textwrap import dedent
from autoblocks.testing.evaluators import BaseLLMJudge
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import EvaluationOverride
from autoblocks.testing.models import Threshold
from autoblocks.testing.models import ScoreChoice
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class IsFriendly(BaseLLMJudge[TestCase, str]):
id = "is-friendly"
threshold=Threshold(
gte=1
)
score_choices = [
ScoreChoice(name="Friendly", value=1),
ScoreChoice(name="Not friendly", value=0),
]
def make_prompt(self, test_case: TestCase, output: str, recent_overrides: list[EvaluationOverride]) -> str:
return dedent(
f"""
Is the output friendly?
[Output]
{output}
""").strip()
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[IsFriendly()],
)
Automatic Battle
The AutomaticBattle
evaluator compares the output of your test function to a baseline maintained by Autoblocks.
On the first run, the baseline will be set to the output of the test function.
On subsequent runs, if the evaluator determines that the new output is better, the baseline will be updated to that output.
If you would like to provide your own baseline, use the ManualBattle
evaluator instead.
Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.
- Name
criteria
- Type
- str
- Required
- Description
The criteria the LLM should use when comparing the output to the baseline.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
You must set the OPENAI_API_KEY
environment variable to use the AutomaticBattle
evaluator.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAutomaticBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class Battle(BaseAutomaticBattle[TestCase, str]):
id = "battle"
criteria = "Choose the best greeting." # Replace with your own criteria
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Battle()],
)
Manual Battle
The ManualBattle
evaluator compares on output to a baseline based on the criteria you provide.
If you would like Autoblocks to automatically manage the baseline, use the AutomaticBattle
evaluator instead.
Scores 1 if the challenger wins, 0.5 if it's a tie, and 0 if the baseline wins.
- Name
criteria
- Type
- str
- Required
- Description
The criteria the LLM should use when comparing the output to the baseline.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map your output to a string for comparison.
- Name
baseline_mapper
- Type
- Callable[[TestCaseType], str]
- Required
- Description
Map the baseline ground truth from your test case for comparison.
You must set the OPENAI_API_KEY
environment variable to use the ManualBattle
evaluator.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseManualBattle
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class Battle(BaseManualBattle[TestCase, str]):
id = "battle"
criteria = "Choose the best greeting." # Replace with your own criteria
def output_mapper(self, output: str) -> str:
return output
def baseline_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hi world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Battle()],
)
Accuracy
The Accuracy
evaluator checks if the output is accurate compared to an expected output.
Scores 1 if the output is accurate, 0.5 if somewhat accurate, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
expected_output_mapper
- Type
- Callable[[TestCaseType], str]
- Required
- Description
Map the test case to the expected output string.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4o
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
- Name
example_expected_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the expected output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseAccuracy
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
expected_output: str
def hash(self) -> str:
return md5(self.input)
class Accuracy(BaseAccuracy[TestCase, str]):
id = "accuracy"
def output_mapper(self, output: str) -> str:
return output
def expected_output_mapper(self, test_case: TestCase) -> str:
return test_case.expected_output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="hello world",
expected_output="hello world",
)
],
fn=lambda test_case: test_case.input,
evaluators=[Accuracy()],
)
NSFW
The NSFW
evaluator checks if the output is considered safe for work.
Scores 1 if the output is safe for work, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4o
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseNSFW
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class NSFW(BaseNSFW[TestCase, str]):
id = "nsfw"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[NSFW()],
)
Toxicity
The Toxicity
evaluator checks if the output is considered safe for work.
Scores 1 if the output is not toxic, 0 otherwise.
- Name
output_mapper
- Type
- Callable[[OutputType], str]
- Required
- Description
Map the output to a string to pass to the LLM judge.
- Name
model
- Type
- str
- Description
The model to use for the LLM judge. It must be an OpenAI model that supports tools. Defaults to
gpt-4o
.
- Name
num_overrides
- Type
- int
- Description
The number of recent evaluation overrides to to use as examples for the LLM judge. Defaults to 0.
- Name
example_output_mapper
- Type
- Callable[[EvaluationOverride], str]
- Description
Map an
EvaluationOverride
to a string representation of the output. This gets passed to the LLM judge as an example. If you setnum_overrides
to a non-zero number, this method must be implemented.
from dataclasses import dataclass
from autoblocks.testing.evaluators import BaseToxicity
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
input: str
def hash(self) -> str:
return md5(self.input)
class Toxicity(BaseToxicity[TestCase, str]):
id = "toxicity"
def output_mapper(self, output: str) -> str:
return output
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
input="Hi how are you?",
),
TestCase(
input='I hate you!'
)
],
fn=lambda test_case: test_case.input,
evaluators=[Toxicity()],
)
Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. We have built wrappers around the metrics to make integration with Autoblocks seamless.
Available Ragas evaluators:
BaseRagasLLMContextPrecisionWithReference
uses a LLM to measure the proportion of relevant chunks in the retrieved_contexts.BaseRagasNonLLMContextPrecisionWithReference
measures the proportion of relevant chunks in the retrieved_contexts without using a LLM.BaseRagasLLMContextRecall
evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.BaseRagasNonLLMContextRecall
uses non llm string comparison metrics to identify if a retrieved context is relevant or not.BaseRagasContextEntitiesRecall
evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.BaseRagasNoiseSensitivity
measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.BaseRagasResponseRelevancy
focuses on assessing how pertinent the generated answer is to the given prompt.BaseRagasFaithfulness
measures the factual consistency of the generated answer against the given context.BaseRagasFactualCorrectness
compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference.BaseRagasSemanticSimilarity
measures the semantic resemblance between the generated answer and the ground truth.
The Ragas evaluators are only available in the Python SDK. You must install Ragas (pip install ragas
) before using these evaluators.
Our wrappers require at least version 0.2.*
of Ragas.
- Name
id
- Type
- str
- Required
- Description
The unique identifier for the evaluator.
- Name
threshold
- Type
- Threshold
- Description
The threshold for the evaluation used to determine pass/fail.
- Name
llm
- Type
- Any
- Description
Custom LLM for the evaluation. Required for any Ragas evaluator that uses a LLM. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/
- Name
embeddings
- Type
- Any
- Description
Custom embeddings model for the evaluation. Required for any Ragas evaluator that uses embeddings. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/
- Name
mode
- Type
- str
- Description
Only applicable for the
BaseRagasFactualCorrectness
evaluator.Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness
- Name
atomicity
- Type
- str
- Description
Only applicable for the
BaseRagasFactualCorrectness
evaluator.Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness
- Name
focus
- Type
- str
- Description
Only applicable for the
BaseRagasNoiseSensitivity
andBaseRagasFaithfulness
evaluator.Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity and https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness
- Name
user_input_mapper
- Type
- Callable[[TestCaseType, OutputType], str]
- Description
Map your test case or output to the user input passed to Ragas.
- Name
response_mapper
- Type
- Callable[OutputType], str]
- Description
Map your output to the response passed to Ragas.
- Name
reference_mapper
- Type
- Callable[[TestCaseType], str]
- Description
Map your test case to the reference passed to Ragas.
- Name
retrieved_contexts_mapper
- Type
- Callable[[TestCaseType, OutputType], str]
- Description
Map your test case and output to the retrieved contexts passed to Ragas.
- Name
reference_contexts_mapper
- Type
- Callable[[TestCaseType], str]
- Description
Map your test case to the reference contexts passed to Ragas.
Individual Ragas evaluators require different parameters. You can find sample implementations for each of the Ragas evaluators here.
from dataclasses import dataclass
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper # type: ignore[import-untyped]
from ragas.llms import LangchainLLMWrapper # type: ignore[import-untyped]
from autoblocks.testing.evaluators import BaseRagasResponseRelevancy
from autoblocks.testing.models import BaseTestCase
from autoblocks.testing.models import Threshold
from autoblocks.testing.run import run_test_suite
from autoblocks.testing.util import md5
@dataclass
class TestCase(BaseTestCase):
question: str
expected_answer: str
def hash(self) -> str:
return md5(self.question)
@dataclass
class Output:
answer: str
contexts: list[str]
# You can use any of the Ragas evaluators listed here:
# https://docs.autoblocks.ai/testing/offline-evaluations#out-of-box-evaluators-ragas
class ResponseRelevancy(BaseRagasResponseRelevancy[TestCase, Output]):
id = "response-relevancy"
threshold = Threshold(gte=1)
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
def user_input_mapper(self, test_case: TestCase, output: Output) -> str:
return test_case.question
def response_mapper(self, output: Output) -> str:
return output.answer
def retrieved_contexts_mapper(self, test_case: TestCase, output: Output) -> list[str]:
return output.contexts
run_test_suite(
id="my-test-suite",
test_cases=[
TestCase(
question="How tall is the Eiffel Tower?",
expected_answer="300 meters"
)
],
fn=lambda test_case: Output(
answer="300 meters",
contexts=["The Eiffel tower stands 300 meters tall."],
),
evaluators=[ResponseRelevancy()],
)